机器学习测试笔记(12)——线性回归方法(下)

2021-01-04 11:35:38 浏览数 (1)

3.逻辑回归

大家都知道符号函数:

f(x)=0 (x=0)

=-1 (x<0)

=1 (x>1)

下面这个函数是符号函数的一个变种:

g(x) =0.5 (x=0)

=0 (x<0)

=1 (x>1)

函数

正好符合这个函数与这个函数相吻合,且他是一个连续函数.函数曲线如下:

我们把这个函数叫做逻辑函数,由y=wx,我们令z= wx, 即

,这样当y=0, g(x)’=0.5; y>0, g(x)’>0.5且趋于1;y<0, g(x)’<0.5且趋于0,从而达到二分类的目的。sklearn.linear_model通过LogisticRegression类实现逻辑回归。

代码语言:javascript复制
from sklearn.linear_model import LogisticRegression
#对sklearn数据进行分析
def useing_sklearn_datasets_for_LogisticRegression():
        # 读取和划分数据
        cancer =  datasets.load_breast_cancer()
        X = cancer.data
        y = cancer.target
        print("X的shape={},正样本数:{},负样本数:{}".format(X.shape, y[y  == 1].shape[0], y[y == 0].shape[0]))
        X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.2)
        # 训练模型
        model =  LogisticRegression()
        model.fit(X_train,  y_train)
        # 查看模型得分
        train_score =  model.score(X_train, y_train)
        test_score =  model.score(X_test, y_test)
        print("乳腺癌训练集得分:{trs:.2f},乳腺癌测试集得分:{tss:.2f}".format(trs=train_score,  tss=test_score))

输出

代码语言:javascript复制
X的shape=(569,  30),正样本数:357,负样本数:212
乳腺癌训练集得分:0.95,乳腺癌测试集得分:0.95

这个结果还是非常满意的。

4.岭回归

岭回归(英文名:Ridgeregression, Tikhonov regularization)是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘法。岭回归通过牺牲训练集得分,获得测试集得分。采用Ridge函数实现

代码语言:javascript复制
from sklearn.linear_model import Ridge
#岭回归进行分析
def useing_sklearn_datasets_for_Ridge():
       X,y =  datasets.load_diabetes().data,datasets.load_diabetes().target
       X_train,X_test,y_train,y_test  = train_test_split(X, y, random_state=8,test_size=0.3)
       lr =  LinearRegression().fit(X_train,y_train)
       ridge =  Ridge().fit(X_train,y_train)
       print('alpha=1,糖尿病训练集得分:  {:.2f}'.format(ridge.score(X_train,y_train)))
       print('alpha=1,糖尿病测试集得分:  {:.2f}'.format(ridge.score(X_test,y_test)))
       ridge10 =  Ridge(alpha=10).fit(X_train,y_train)
       print('alpha=10,糖尿病训练集得分:  {:.2f}'.format(ridge10.score(X_train,y_train)))
       print('alpha=10,糖尿病测试集得分:  {:.2f}'.format(ridge10.score(X_test,y_test)))
       ridge01 =  Ridge(alpha=0.1).fit(X_train,y_train)
       print('alpha=0.1,糖尿病训练集得分:  {:.2f}'.format(ridge01.score(X_train,y_train)))
       print('alpha=0.1,糖尿病测试集得分:  {:.2f}'.format(ridge01.score(X_test,y_test)))

输出

代码语言:javascript复制
alpha=1,糖尿病训练集得分: 0.43
alpha=1,糖尿病测试集得分: 0.43
alpha=10,糖尿病训练集得分: 0.14
alpha=10,糖尿病测试集得分: 0.16
alpha=0.1,糖尿病训练集得分: 0.52
alpha=0.1,糖尿病测试集得分: 0.47

通过下表分析一下各个alpha下训练集和测试集下的得分。

alpha

训练集得分

测试集得分

1

0.43

0.43

10

0.14

0.16

0.1

0.52

0.47

线性回归

0.54

0.45

代码语言:javascript复制
plt.plot(ridge.coef_,'s',label='Ridge alpha=1')
plt.plot(ridge10.coef_,'^',label='Ridge alpha=10')
plt.plot(ridge01.coef_,'v',label='Ridge alpha=0.1')
plt.plot(lr.coef_,'o',label='Linear Regression')
plt.xlabel('coefficient index')
plt.ylabel('coefficient magnitude')
plt.hlines(0,0,len(lr.coef_))
plt.show()
  • alpha =10 在0区间变化(^ 橘黄色上箭头)
  • alpha =1 变化变大(s 蓝色方块)
  • alpha = 0.1变化更大,接近线性(v 绿色下箭头)
  • 线性:变化超大,超出图(o 红色圆点)

alpha越大,收敛性越好

代码语言:javascript复制
from sklearn.model_selection import learning_curve,KFold
#定义一个绘制学习曲线的函数
def plot_learning_curve(est, X, y):#learning_curve:学习曲线
tarining_set_size,train_scores,test_scores = learning_curve(
                 est,X,y,train_sizes=np.linspace(.1,1,20),cv=KFold(20,shuffle=True,random_state=1))
estimator_name = est.__class__.__name__
line =  plt.plot(tarining_set_size,train_scores.mean(axis=1),'--',label="training" estimator_name)
plt.plot(tarining_set_size,test_scores.mean(axis=1),'-',label="test" estimator_name,c=line[0].get_color())
plt.xlabel('training set size')
plt.ylabel('Score')
plt.ylim(0,1.1)
代码语言:javascript复制
     plot_learning_curve(Ridge(alpha=1),  X,y)
     plot_learning_curve(LinearRegression(), X,y)
     plt.legend(loc=(0,1.05),ncol=2,fontsize=11)
    plt.show()

以上结果说明:

  1. 训练集得分比测试集得分要高;
  2. 岭回归测试集得分比线性回归测试集得分要低;
  3. 岭回归测试集得分与训练集得分差不多;
  4. 训练集小的时候,线性模型都学不到什么东西;
  5. 训练集加大,两个得分相同。

5.套索回归

套索回归(英文名Lasso Regression)略同于岭回归。在实践中,岭回归与套索回归首先岭回归。但是,如果特征特别多,而某些特征更重要,具有选择性,那就选择Lasso可能更好。采用Lasso函数实现。

代码语言:javascript复制
from sklearn.linear_model import Lasso
#对套索回归进行分析
def useing_sklearn_datasets_for_Lasso():
       X,y = datasets.load_diabetes().data,datasets.load_diabetes().target
       X_train,X_test,y_train,y_test  = train_test_split(X, y, random_state=8,test_size=0.3)
       lasso01 =  Lasso().fit(X_train,y_train)
       print('alpha=1,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train)))
       print('alpha=1,糖尿病测试集得分:  {:.2f}'.format(lasso01.score(X_test,y_test)))
       print('alpha=1,套索回归特征数:  {}'.format(np.sum(lasso01.coef_!=0)))
       lasso01 =  Lasso(alpha=0.1,max_iter=100000).fit(X_train,y_train)
       print('alpha=0.1,max_iter=100000,糖尿病训练集得分:  {:.2f}'.format(lasso01.score(X_train,y_train)))
       print('alpha=0.1,max_iter=100000,糖尿病测试集得分:  {:.2f}'.format(lasso01.score(X_test,y_test)))
       print('alpha=1,套索回归特征数:  {}'.format(np.sum(lasso01.coef_!=0)))
       lasso01 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train)
       print('alpha=0.0001,max_iter=100000,糖尿病训练集得分:  {:.2f}'.format(lasso01.score(X_train,y_train)))
       print('alpha=0.0001,max_iter=100000,糖尿病测试集得分:  {:.2f}'.format(lasso01.score(X_test,y_test)))
       print('alpha=1,套索回归特征数:  {}'.format(np.sum(lasso01.coef_!=0)))

输出

代码语言:javascript复制
alpha=1,糖尿病训练集得分: 0.37
alpha=1,糖尿病测试集得分: 0.38
alpha=1,套索回归特征数: 3
alpha=0.1,max_iter=100000,糖尿病训练集得分: 0.52
alpha=0.1,max_iter=100000,糖尿病测试集得分: 0.48
alpha=1,套索回归特征数: 7
alpha=0.0001,max_iter=100000,糖尿病训练集得分: 0.53
alpha=0.0001,max_iter=100000,糖尿病测试集得分: 0.45
alpha=1,套索回归特征数: 10
  • alpha=1,特征数为3,得分低,出现欠拟合
  • alpha=0.1,降低alpha值可以加大得分,特征数提高到7
  • alpha=0.01,测试集得分: 0.45<alpha=0.1的测试集得分: 0.48,说明降低alpha值让模型。更倾向于出现过拟合现象。

比较岭回归与套索回归

代码语言:javascript复制
def Ridge_VS_Lasso():
       X,y =  datasets.load_diabetes().data,datasets.load_diabetes().target
       X_train,X_test,y_train,y_test  = train_test_split(X, y, random_state=8,test_size=0.3)
       lasso =  Lasso(alpha=1,max_iter=100000).fit(X_train,y_train)
       plt.plot(lasso.coef_,'s',label='lasso  alpha=1')
       lasso01 = Lasso(alpha=0.  1,max_iter=100000).fit(X_train,y_train)
       plt.plot(lasso01.coef_,'^',label='lasso  alpha=0. 1')
       lasso0001 =  Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train)
       plt.plot(lasso0001.coef_,'v',label='lasso  alpha=0.001')
       ridge01 = Ridge(alpha=0.1).fit(X_train,y_train)
       plt.plot(ridge01.coef_,'o',label='ridge01  alpha=0.1')
       plt.legend(ncol=2,loc=(0,1.05))
       plt.ylim(-1000,750)
       plt.xlabel('Coefficient  index')
       plt.ylabel("Coefficient  magnitude")
       plt.show()

以上结果说明:

  • alpha=1,大部分系数都在0附近
  • alpha=0.1,大部分系数都在0附近,但是比=1时少很多,有些不等于1。
  • alpha=0.001,整个模型被正则化,大部分不等于0。
  • alpha=0.1的岭回归与套索回归基本一致。

数据特征比较多,并且有一小部分真正重要,用套索回归,否则用岭回归。数据和方法。

6. 用sklearn数据测试所有线性模型

建立文件machinelearn_data_model.py。

代码语言:javascript复制
# coding:utf-8
 
import numpy as np
import matplotlib.pyplot  as plt
from sklearn.linear_model  import LinearRegression
from sklearn.linear_model  import LogisticRegression
from sklearn.linear_model  import Ridge
from sklearn.linear_model  import Lasso
from sklearn.neighbors  import KNeighborsClassifier
from sklearn.datasets import  make_regression
from  sklearn.model_selection import train_test_split
from sklearn import  datasets
from sklearn.pipeline  import Pipeline
from sklearn.preprocessing  import StandardScaler
from sklearn.svm import  LinearSVR
import statsmodels.api as  sm
      
 
class data_for_model:
def machine_learn(data,model):
if data ==  "iris":
     mydata = datasets.load_iris()
elif data ==  "wine":
     mydata = datasets.load_wine()
elif data ==  "breast_cancer":
     mydata = datasets.load_breast_cancer()
elif data == "diabetes":
     mydata = datasets.load_diabetes()
elif data ==  "boston":
     mydata = datasets.load_boston()
elif data !=  "two_moon":
     return "提供的数据错误,包括:iris,wine,barest_cancer,diabetes,boston,two_moon"
if data ==  "two_moon":
     X,y = datasets.make_moons(n_samples=200,noise=0.05,  random_state=0)
elif model ==  "DecisionTreeClassifie"or "RandomForestClassifie":
    X,y = mydata.data[:,:2],mydata.target
else:
    X,y = mydata.data,mydata.target
       X_train,X_test,y_train,y_test =  train_test_split(X, y)
        if model ==  "LinearRegression":
               md =  LinearRegression().fit(X_train,y_train)
        elif model ==  "LogisticRegression":
               if data == "boston":
                    y_train =  y_train.astype('int')
                    y_test =  y_test.astype('int')
               md =  LogisticRegression().fit(X_train,y_train)
        elif model == "Ridge":
               md =  Ridge(alpha=0.1).fit(X_train,y_train)
        elif model == "Lasso":
               md = Lasso(alpha=0.0001,max_iter=10000000).fit(X_train,y_train)
        elif model == "SVM":
               md =  LinearSVR(C=2).fit(X_train,y_train)
        elif model == "sm":
               md = sm.OLS(y,X).fit()
else:
               return "提供的模型错误,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm "
        if model == "sm":
                print("results.params(diabetes):n",md.params,  "nresults.summary(diabetes):n",md.summary())
        else:
            print("模型:",model,"数据:",data,"训练集评分:{:.2%}".format(md.score(X_train,y_train)))
            print("模型:",model,"数据:",data,"测试集评分:{:.2%}".format(md.score(X_test,y_test)))

在这里考虑:

  • LogisticRegression算法在波士顿房价下要求目标y必须为int类型,所以做了判断;
  • Ridge 算法的alpha参数为0.1;
  • Lasso算法的alpha参数为0.0001, 最大迭代数为10,000,000

这样,我们就可以对指定模型指定数据进行定量分析

代码语言:javascript复制
from  machinelearn_data_model import data_for_model
def  linear_for_all_data_and_model():
        datas =  ["iris","wine","breast_cancer","diabetes","boston","two_moon"]
        models =  ["LinearRegression","LogisticRegression","Ridge","Lasso","SVM","sm"]
        for data in datas:
                for model in models:
                         data_for_model.machine_learn(data,model)

我们对测试结果进行比较:

数据

模型

训练集

测试集

鸢尾花

LinearRegression

92.7%

93.7%

LogisticRegression

96.4%

100%

Ridge

93.1%

92.8%

Lasso

92.8%

93.1%

StatsModels OLS

0.972

鸢尾花数据在所有训练模型下表现均很好

红酒

LinearRegression

90.6%

85.1%

LogisticRegression

97.7%

95.6%

Ridge

90.2%

86.8%

Lasso

91.0%

85.2%

StatsModels OLS

0.948

红酒数据在所有训练模型下表现均很好,但比鸢尾花略差些

乳腺癌

LinearRegression

79.1%

68.9%

LogisticRegression

95.3%

93.0%

Ridge

75.7%

74.5%

Lasso

77.6%

71.4%

StatsModels OLS

0.908

乳腺癌数据仅在逻辑回归和OLS模型上表现很好

糖尿病

LinearRegression

52.5%

47.9%

LogisticRegression

02.7%

00.0%

Ridge

51.5%

49.2%

Lasso

51.5%

50.2%

StatsModels OLS

0.106

糖尿病数据在所有模型下表现均不好

波士顿房价

LinearRegression

74.5%

70.9%

LogisticRegression

20.8%

11.0%

Ridge

76.0%

62.7%

Lasso

73.5%

74.5%

StatsModels OLS

0.959

波士顿房价数据仅在OLS模型上表现很好,在其他模型下表现均不佳。但是处理逻辑回归模型下,表现比糖尿病略好。

2个月亮

LinearRegression

66.9%

63.0%

LogisticRegression

89.3%

86.0%

Ridge

66.3%

64.3%

Lasso

65.3%

65.2%

StatsModels OLS

0.501

2个月亮数据在LogisticRegressio模型下表现最好,其他表现不太好。

总结如下表(绿色表现好,红色表现不好,紫色一般):

数据类型

LinearRegression

LogisticRegression

Ridge

Lasso

OLS

鸢尾花

红酒

乳腺癌

糖尿病

波士顿房价

2个月亮

我们最后把KNN算法也捆绑进去,machinelearn_data_model.py经过如下改造

代码语言:javascript复制
…
elif model == "KNeighborsClassifier":
if data == "boston":
y_train = y_train.astype('int')
y_test = y_test.astype('int')
md = KNeighborsClassifier().fit(X_train,y_train)
else:
       return "提供的模型错误,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm,KNeighborsClassifier"…

调用测试程序

代码语言:javascript复制
from machinelearn_data_model import data_for_model
def KNN_for_all_data_and_model():
        datas =  ["iris","wine","breast_cancer","diabetes","boston","two_moon"]
        models =  ["KNeighborsClassifier"]
        for data in datas:
                for model in  models:
                         data_for_model.machine_learn(data,model)

得到测试结果

数据

模型

训练集

测试集

鸢尾花

KNeighborsClassifier

95.5%

97.4%

红酒

KNeighborsClassifier

76.7%

68.9%

乳腺癌

KNeighborsClassifier

95.3%

90.2%

糖尿病

KNeighborsClassifier

19.6%

00.0%

波士顿房价

KNeighborsClassifier

36.4%

04.7%

2个月亮

KNeighborsClassifier

100.00%

100.00%

由此可见,KNeighborsClassifier对鸢尾花,乳腺癌和2个月亮数据是比较有效的。

数据类型

KNeighborsClassifier

鸢尾花

红酒

乳腺癌

糖尿病

波士顿房价

2个月亮

——————————————————————————————————

顾老师课程欢迎报名

软件安全测试

https://study.163.com/course/courseMain.htm?courseId=1209779852&share=2&shareId=480000002205486

接口自动化测试

https://study.163.com/course/courseMain.htm?courseId=1209794815&share=2&shareId=480000002205486

DevOps和Jenkins之DevOps

https://study.163.com/course/courseMain.htm?courseId=1209817844&share=2&shareId=480000002205486

DevOps与Jenkins 2.0之詹金斯

https://study.163.com/course/courseMain.htm?courseId=1209819843&share=2&shareId=480000002205486

硒自动化测试

https://study.163.com/course/courseMain.htm?courseId=1209835807&share=2&shareId=480000002205486

性能测试第1季:性能测试基础知识

https://study.163.com/course/courseMain.htm?courseId=1209852815&share=2&shareId=480000002205486

性能测试第2季:LoadRunner12使用

https://study.163.com/course/courseMain.htm?courseId=1209980013&share=2&shareId=480000002205486

性能测试第3季:JMeter工具使用

https://study.163.com/course/courseMain.htm?courseId=1209903814&share=2&shareId=480000002205486

性能测试第4季:监控与调优

https://study.163.com/course/courseMain.htm?courseId=1209959801&share=2&shareId=480000002205486

Django入门

https://study.163.com/course/courseMain.htm?courseId=1210020806&share=2&shareId=480000002205486

啄木鸟顾老师漫谈软件测试

https://study.163.com/course/courseMain.htm?courseId=1209958326&share=2&shareId=480000002205486

0 人点赞