3.逻辑回归
大家都知道符号函数:
f(x)=0 (x=0)
=-1 (x<0)
=1 (x>1)
下面这个函数是符号函数的一个变种:
g(x) =0.5 (x=0)
=0 (x<0)
=1 (x>1)
函数
,正好符合这个函数与这个函数相吻合,且他是一个连续函数.函数曲线如下:
我们把这个函数叫做逻辑函数,由y=wx,我们令z= wx, 即
,这样当y=0, g(x)’=0.5; y>0, g(x)’>0.5且趋于1;y<0, g(x)’<0.5且趋于0,从而达到二分类的目的。sklearn.linear_model通过LogisticRegression类实现逻辑回归。
代码语言:javascript复制from sklearn.linear_model import LogisticRegression
#对sklearn数据进行分析
def useing_sklearn_datasets_for_LogisticRegression():
# 读取和划分数据
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
print("X的shape={},正样本数:{},负样本数:{}".format(X.shape, y[y == 1].shape[0], y[y == 0].shape[0]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 训练模型
model = LogisticRegression()
model.fit(X_train, y_train)
# 查看模型得分
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print("乳腺癌训练集得分:{trs:.2f},乳腺癌测试集得分:{tss:.2f}".format(trs=train_score, tss=test_score))
输出
代码语言:javascript复制X的shape=(569, 30),正样本数:357,负样本数:212
乳腺癌训练集得分:0.95,乳腺癌测试集得分:0.95
这个结果还是非常满意的。
4.岭回归
岭回归(英文名:Ridgeregression, Tikhonov regularization)是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘法。岭回归通过牺牲训练集得分,获得测试集得分。采用Ridge函数实现
代码语言:javascript复制from sklearn.linear_model import Ridge
#岭回归进行分析
def useing_sklearn_datasets_for_Ridge():
X,y = datasets.load_diabetes().data,datasets.load_diabetes().target
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3)
lr = LinearRegression().fit(X_train,y_train)
ridge = Ridge().fit(X_train,y_train)
print('alpha=1,糖尿病训练集得分: {:.2f}'.format(ridge.score(X_train,y_train)))
print('alpha=1,糖尿病测试集得分: {:.2f}'.format(ridge.score(X_test,y_test)))
ridge10 = Ridge(alpha=10).fit(X_train,y_train)
print('alpha=10,糖尿病训练集得分: {:.2f}'.format(ridge10.score(X_train,y_train)))
print('alpha=10,糖尿病测试集得分: {:.2f}'.format(ridge10.score(X_test,y_test)))
ridge01 = Ridge(alpha=0.1).fit(X_train,y_train)
print('alpha=0.1,糖尿病训练集得分: {:.2f}'.format(ridge01.score(X_train,y_train)))
print('alpha=0.1,糖尿病测试集得分: {:.2f}'.format(ridge01.score(X_test,y_test)))
输出
代码语言:javascript复制alpha=1,糖尿病训练集得分: 0.43
alpha=1,糖尿病测试集得分: 0.43
alpha=10,糖尿病训练集得分: 0.14
alpha=10,糖尿病测试集得分: 0.16
alpha=0.1,糖尿病训练集得分: 0.52
alpha=0.1,糖尿病测试集得分: 0.47
通过下表分析一下各个alpha下训练集和测试集下的得分。
alpha | 训练集得分 | 测试集得分 |
---|---|---|
1 | 0.43 | 0.43 |
10 | 0.14 | 0.16 |
0.1 | 0.52 | 0.47 |
线性回归 | 0.54 | 0.45 |
plt.plot(ridge.coef_,'s',label='Ridge alpha=1')
plt.plot(ridge10.coef_,'^',label='Ridge alpha=10')
plt.plot(ridge01.coef_,'v',label='Ridge alpha=0.1')
plt.plot(lr.coef_,'o',label='Linear Regression')
plt.xlabel('coefficient index')
plt.ylabel('coefficient magnitude')
plt.hlines(0,0,len(lr.coef_))
plt.show()
- alpha =10 在0区间变化(^ 橘黄色上箭头)
- alpha =1 变化变大(s 蓝色方块)
- alpha = 0.1变化更大,接近线性(v 绿色下箭头)
- 线性:变化超大,超出图(o 红色圆点)
alpha越大,收敛性越好
代码语言:javascript复制from sklearn.model_selection import learning_curve,KFold
#定义一个绘制学习曲线的函数
def plot_learning_curve(est, X, y):#learning_curve:学习曲线
tarining_set_size,train_scores,test_scores = learning_curve(
est,X,y,train_sizes=np.linspace(.1,1,20),cv=KFold(20,shuffle=True,random_state=1))
estimator_name = est.__class__.__name__
line = plt.plot(tarining_set_size,train_scores.mean(axis=1),'--',label="training" estimator_name)
plt.plot(tarining_set_size,test_scores.mean(axis=1),'-',label="test" estimator_name,c=line[0].get_color())
plt.xlabel('training set size')
plt.ylabel('Score')
plt.ylim(0,1.1)
代码语言:javascript复制 plot_learning_curve(Ridge(alpha=1), X,y)
plot_learning_curve(LinearRegression(), X,y)
plt.legend(loc=(0,1.05),ncol=2,fontsize=11)
plt.show()
以上结果说明:
- 训练集得分比测试集得分要高;
- 岭回归测试集得分比线性回归测试集得分要低;
- 岭回归测试集得分与训练集得分差不多;
- 训练集小的时候,线性模型都学不到什么东西;
- 训练集加大,两个得分相同。
5.套索回归
套索回归(英文名Lasso Regression)略同于岭回归。在实践中,岭回归与套索回归首先岭回归。但是,如果特征特别多,而某些特征更重要,具有选择性,那就选择Lasso可能更好。采用Lasso函数实现。
代码语言:javascript复制from sklearn.linear_model import Lasso
#对套索回归进行分析
def useing_sklearn_datasets_for_Lasso():
X,y = datasets.load_diabetes().data,datasets.load_diabetes().target
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3)
lasso01 = Lasso().fit(X_train,y_train)
print('alpha=1,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train)))
print('alpha=1,糖尿病测试集得分: {:.2f}'.format(lasso01.score(X_test,y_test)))
print('alpha=1,套索回归特征数: {}'.format(np.sum(lasso01.coef_!=0)))
lasso01 = Lasso(alpha=0.1,max_iter=100000).fit(X_train,y_train)
print('alpha=0.1,max_iter=100000,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train)))
print('alpha=0.1,max_iter=100000,糖尿病测试集得分: {:.2f}'.format(lasso01.score(X_test,y_test)))
print('alpha=1,套索回归特征数: {}'.format(np.sum(lasso01.coef_!=0)))
lasso01 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train)
print('alpha=0.0001,max_iter=100000,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train)))
print('alpha=0.0001,max_iter=100000,糖尿病测试集得分: {:.2f}'.format(lasso01.score(X_test,y_test)))
print('alpha=1,套索回归特征数: {}'.format(np.sum(lasso01.coef_!=0)))
输出
代码语言:javascript复制alpha=1,糖尿病训练集得分: 0.37
alpha=1,糖尿病测试集得分: 0.38
alpha=1,套索回归特征数: 3
alpha=0.1,max_iter=100000,糖尿病训练集得分: 0.52
alpha=0.1,max_iter=100000,糖尿病测试集得分: 0.48
alpha=1,套索回归特征数: 7
alpha=0.0001,max_iter=100000,糖尿病训练集得分: 0.53
alpha=0.0001,max_iter=100000,糖尿病测试集得分: 0.45
alpha=1,套索回归特征数: 10
- alpha=1,特征数为3,得分低,出现欠拟合
- alpha=0.1,降低alpha值可以加大得分,特征数提高到7
- alpha=0.01,测试集得分: 0.45<alpha=0.1的测试集得分: 0.48,说明降低alpha值让模型。更倾向于出现过拟合现象。
比较岭回归与套索回归
代码语言:javascript复制def Ridge_VS_Lasso():
X,y = datasets.load_diabetes().data,datasets.load_diabetes().target
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3)
lasso = Lasso(alpha=1,max_iter=100000).fit(X_train,y_train)
plt.plot(lasso.coef_,'s',label='lasso alpha=1')
lasso01 = Lasso(alpha=0. 1,max_iter=100000).fit(X_train,y_train)
plt.plot(lasso01.coef_,'^',label='lasso alpha=0. 1')
lasso0001 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train)
plt.plot(lasso0001.coef_,'v',label='lasso alpha=0.001')
ridge01 = Ridge(alpha=0.1).fit(X_train,y_train)
plt.plot(ridge01.coef_,'o',label='ridge01 alpha=0.1')
plt.legend(ncol=2,loc=(0,1.05))
plt.ylim(-1000,750)
plt.xlabel('Coefficient index')
plt.ylabel("Coefficient magnitude")
plt.show()
以上结果说明:
- alpha=1,大部分系数都在0附近
- alpha=0.1,大部分系数都在0附近,但是比=1时少很多,有些不等于1。
- alpha=0.001,整个模型被正则化,大部分不等于0。
- alpha=0.1的岭回归与套索回归基本一致。
数据特征比较多,并且有一小部分真正重要,用套索回归,否则用岭回归。数据和方法。
6. 用sklearn数据测试所有线性模型
建立文件machinelearn_data_model.py。
代码语言:javascript复制# coding:utf-8
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR
import statsmodels.api as sm
class data_for_model:
def machine_learn(data,model):
if data == "iris":
mydata = datasets.load_iris()
elif data == "wine":
mydata = datasets.load_wine()
elif data == "breast_cancer":
mydata = datasets.load_breast_cancer()
elif data == "diabetes":
mydata = datasets.load_diabetes()
elif data == "boston":
mydata = datasets.load_boston()
elif data != "two_moon":
return "提供的数据错误,包括:iris,wine,barest_cancer,diabetes,boston,two_moon"
if data == "two_moon":
X,y = datasets.make_moons(n_samples=200,noise=0.05, random_state=0)
elif model == "DecisionTreeClassifie"or "RandomForestClassifie":
X,y = mydata.data[:,:2],mydata.target
else:
X,y = mydata.data,mydata.target
X_train,X_test,y_train,y_test = train_test_split(X, y)
if model == "LinearRegression":
md = LinearRegression().fit(X_train,y_train)
elif model == "LogisticRegression":
if data == "boston":
y_train = y_train.astype('int')
y_test = y_test.astype('int')
md = LogisticRegression().fit(X_train,y_train)
elif model == "Ridge":
md = Ridge(alpha=0.1).fit(X_train,y_train)
elif model == "Lasso":
md = Lasso(alpha=0.0001,max_iter=10000000).fit(X_train,y_train)
elif model == "SVM":
md = LinearSVR(C=2).fit(X_train,y_train)
elif model == "sm":
md = sm.OLS(y,X).fit()
else:
return "提供的模型错误,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm "
if model == "sm":
print("results.params(diabetes):n",md.params, "nresults.summary(diabetes):n",md.summary())
else:
print("模型:",model,"数据:",data,"训练集评分:{:.2%}".format(md.score(X_train,y_train)))
print("模型:",model,"数据:",data,"测试集评分:{:.2%}".format(md.score(X_test,y_test)))
在这里考虑:
- LogisticRegression算法在波士顿房价下要求目标y必须为int类型,所以做了判断;
- Ridge 算法的alpha参数为0.1;
- Lasso算法的alpha参数为0.0001, 最大迭代数为10,000,000
这样,我们就可以对指定模型指定数据进行定量分析
代码语言:javascript复制from machinelearn_data_model import data_for_model
def linear_for_all_data_and_model():
datas = ["iris","wine","breast_cancer","diabetes","boston","two_moon"]
models = ["LinearRegression","LogisticRegression","Ridge","Lasso","SVM","sm"]
for data in datas:
for model in models:
data_for_model.machine_learn(data,model)
我们对测试结果进行比较:
数据 | 模型 | 训练集 | 测试集 |
---|---|---|---|
鸢尾花 | LinearRegression | 92.7% | 93.7% |
LogisticRegression | 96.4% | 100% | |
Ridge | 93.1% | 92.8% | |
Lasso | 92.8% | 93.1% | |
StatsModels OLS | 0.972 | ||
鸢尾花数据在所有训练模型下表现均很好 | |||
红酒 | LinearRegression | 90.6% | 85.1% |
LogisticRegression | 97.7% | 95.6% | |
Ridge | 90.2% | 86.8% | |
Lasso | 91.0% | 85.2% | |
StatsModels OLS | 0.948 | ||
红酒数据在所有训练模型下表现均很好,但比鸢尾花略差些 | |||
乳腺癌 | LinearRegression | 79.1% | 68.9% |
LogisticRegression | 95.3% | 93.0% | |
Ridge | 75.7% | 74.5% | |
Lasso | 77.6% | 71.4% | |
StatsModels OLS | 0.908 | ||
乳腺癌数据仅在逻辑回归和OLS模型上表现很好 | |||
糖尿病 | LinearRegression | 52.5% | 47.9% |
LogisticRegression | 02.7% | 00.0% | |
Ridge | 51.5% | 49.2% | |
Lasso | 51.5% | 50.2% | |
StatsModels OLS | 0.106 | ||
糖尿病数据在所有模型下表现均不好 | |||
波士顿房价 | LinearRegression | 74.5% | 70.9% |
LogisticRegression | 20.8% | 11.0% | |
Ridge | 76.0% | 62.7% | |
Lasso | 73.5% | 74.5% | |
StatsModels OLS | 0.959 | ||
波士顿房价数据仅在OLS模型上表现很好,在其他模型下表现均不佳。但是处理逻辑回归模型下,表现比糖尿病略好。 | |||
2个月亮 | LinearRegression | 66.9% | 63.0% |
LogisticRegression | 89.3% | 86.0% | |
Ridge | 66.3% | 64.3% | |
Lasso | 65.3% | 65.2% | |
StatsModels OLS | 0.501 | ||
2个月亮数据在LogisticRegressio模型下表现最好,其他表现不太好。 |
总结如下表(绿色表现好,红色表现不好,紫色一般):
数据类型 | LinearRegression | LogisticRegression | Ridge | Lasso | OLS |
---|---|---|---|---|---|
鸢尾花 | |||||
红酒 | |||||
乳腺癌 | |||||
糖尿病 | |||||
波士顿房价 | |||||
2个月亮 |
我们最后把KNN算法也捆绑进去,machinelearn_data_model.py经过如下改造
代码语言:javascript复制…
elif model == "KNeighborsClassifier":
if data == "boston":
y_train = y_train.astype('int')
y_test = y_test.astype('int')
md = KNeighborsClassifier().fit(X_train,y_train)
else:
return "提供的模型错误,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm,KNeighborsClassifier"…
调用测试程序
代码语言:javascript复制from machinelearn_data_model import data_for_model
def KNN_for_all_data_and_model():
datas = ["iris","wine","breast_cancer","diabetes","boston","two_moon"]
models = ["KNeighborsClassifier"]
for data in datas:
for model in models:
data_for_model.machine_learn(data,model)
得到测试结果
数据 | 模型 | 训练集 | 测试集 |
---|---|---|---|
鸢尾花 | KNeighborsClassifier | 95.5% | 97.4% |
红酒 | KNeighborsClassifier | 76.7% | 68.9% |
乳腺癌 | KNeighborsClassifier | 95.3% | 90.2% |
糖尿病 | KNeighborsClassifier | 19.6% | 00.0% |
波士顿房价 | KNeighborsClassifier | 36.4% | 04.7% |
2个月亮 | KNeighborsClassifier | 100.00% | 100.00% |
由此可见,KNeighborsClassifier对鸢尾花,乳腺癌和2个月亮数据是比较有效的。
数据类型 | KNeighborsClassifier |
---|---|
鸢尾花 | |
红酒 | |
乳腺癌 | |
糖尿病 | |
波士顿房价 | |
2个月亮 |
——————————————————————————————————
顾老师课程欢迎报名
软件安全测试
https://study.163.com/course/courseMain.htm?courseId=1209779852&share=2&shareId=480000002205486
接口自动化测试
https://study.163.com/course/courseMain.htm?courseId=1209794815&share=2&shareId=480000002205486
DevOps和Jenkins之DevOps
https://study.163.com/course/courseMain.htm?courseId=1209817844&share=2&shareId=480000002205486
DevOps与Jenkins 2.0之詹金斯
https://study.163.com/course/courseMain.htm?courseId=1209819843&share=2&shareId=480000002205486
硒自动化测试
https://study.163.com/course/courseMain.htm?courseId=1209835807&share=2&shareId=480000002205486
性能测试第1季:性能测试基础知识
https://study.163.com/course/courseMain.htm?courseId=1209852815&share=2&shareId=480000002205486
性能测试第2季:LoadRunner12使用
https://study.163.com/course/courseMain.htm?courseId=1209980013&share=2&shareId=480000002205486
性能测试第3季:JMeter工具使用
https://study.163.com/course/courseMain.htm?courseId=1209903814&share=2&shareId=480000002205486
性能测试第4季:监控与调优
https://study.163.com/course/courseMain.htm?courseId=1209959801&share=2&shareId=480000002205486
Django入门
https://study.163.com/course/courseMain.htm?courseId=1210020806&share=2&shareId=480000002205486
啄木鸟顾老师漫谈软件测试
https://study.163.com/course/courseMain.htm?courseId=1209958326&share=2&shareId=480000002205486