导入数据数据处理线性回归模型性能评估支持向量机回归模型性能评估K近邻分类模型性能测评回归树进行分类性能测评树模型的优缺点集成模型进行分类性能评测
导入数据
代码语言:javascript复制import pandas as pd
data = pd.read_csv('housing.csv')
x = data.iloc[:,data.columns != 'MEDV']
y = data.iloc[:,data.columns == 'MEDV']
总体而言,该数据共有506条美国波土顿地区房价的数据,每条数据包括对指定房屋的13项数值型特征描述和目标房价。另外,该数据中没有缺失的属性/特征值(MissingAttributeValues),更加方便了后续的分析。
数据处理
代码语言:javascript复制#从sklearn.cross_ validation导人数据分割器。
from sklearn.cross_validation import train_test_split
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = ,test_size = 0.25)
#分析回归目标值的差异。
print("The max target value is", np.max(y))
print("The min target value is", np.min(y))
print("The average target value is", np.mean(y))
代码语言:javascript复制The max target value is MEDV 50.0
dtype: float64
The min target value is MEDV 5.0
dtype: float64
The average target value is MEDV 22.532806
dtype: float64
C:ProgramDataAnaconda3libsite-packagessklearncross_validation.py:: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
预测目标房价之间的差异较大,需要对特征以及目标值进行标准化处理。
代码语言:javascript复制#从sklearn. preprocessing导入数据标准化模块。
from sklearn.preprocessing import StandardScaler
#分别初始化对特征和目标值的标准化器。
ss_x = StandardScaler ()
ss_y = StandardScaler ()
#分别对训练和测试数据的特征以及目标值进行标准化处理。
x_train=ss_x.fit_transform(x_train)
x_test=ss_x.transform (x_test)
y_train = ss_y.fit_transform(y_train)
y_test=ss_y.transform(y_test)
线性回归模型
使用最为简单的线性回归模型LinearRegression和SGDRegressor分别对波士顿房价数据进行训练学习以及预测.
代码语言:javascript复制#从sklearn.linear model导人LinearRegression。
from sklearn.linear_model import LinearRegression
#使用默认配置初始化线性回归器LinearRegression.
lr = LinearRegression()
#使用训练数据进行参数估计。
lr.fit(x_train, y_train)
#对测试数据进行回归预测。
lr_y_predict = lr.predict(x_test)
#从sklearn.linear model导入SGDRegressor.
from sklearn.linear_model import SGDRegressor
#使用默认配置初始化线性回归器SGDRegressor.
sgdr = SGDRegressor()
#使用训练数据进行参数估计。
sgdr.fit(x_train,y_train)
#对测试数据进行回归预测。
sgdr_y_predict = sgdr.predict(x_test)
代码语言:javascript复制C:ProgramDataAnaconda3libsite-packagessklearnlinear_modelstochastic_gradient.py:: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDRegressor'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
C:ProgramDataAnaconda3libsite-packagessklearnutilsvalidation.py:: DataConversionWarning: A column-vector y was passed when a d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
性能评估
不同于类别预测,我们不能苛求回归预测的数值结果要严格地与真实值相同。一般情况下,我们希望衡量预测值与真实值之间的差距。因此,可以通过多种测评函数进行评价。其中最为直观的评价指标包括,平均绝对误差以及均方误差,因为这也是线性回归模型所要优化的目标。
代码语言:javascript复制#使用LinearRegression模型自带的评估模块,并输出评估结果。
print('The value of default measurement of LinearRegression is', lr.score (x_test, y_test))
#从sklearn.metrics依次导人r2 score.mean squared error以及mean absoluate_error用于回归性能的评估。
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
#使用r2 score模块,并输出评估结果。
print('The value of R-squared of LinearRegression is', r2_score(y_test,lr_y_predict))
#使用mean squared error模块,并输出评估结果。
print('The mean squared error of LinearRegression is',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(lr_y_predict)))
#使用meanabsolute_error模块,并输出评估结果。
print('The mean absoluate error of LinearRegression is',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(lr_y_predict)))
代码语言:javascript复制The value of default measurement of LinearRegression is 0.6757955014529485
The value of R-squared of LinearRegression is 0.6757955014529485
The mean squared error of LinearRegression is 25.139236520353418
The mean absoluate error of LinearRegression is 3.532532543705395
代码语言:javascript复制#使用SGDRegressor模型自带的评估模块,并输出评估结果。
print('The value of default measurement of SGDRegressor is',
sgdr.score (x_test, y_test))
#使用r2_score模块,并输出评估结果。
print('The value of R- squared of SGDRegressor is',
r2_score(y_test, sgdr_y_predict))
#使用meansquarederror模块,并输出评估结果。
print('The mean squared error of SGDRegressor is',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))
#使用meanabsolute_error模块,并输出评估结果。
print('The mean absoluate error of SGDRegressor is',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(sgdr_y_predict)))
代码语言:javascript复制The value of default measurement of SGDRegressor is 0.6562133712412219
The value of R- squared of SGDRegressor is 0.6562133712412219
The mean squared error of SGDRegressor is 26.657660247263877
The mean absoluate error of SGDRegressor is 3.5141581908907273
- 使用随机梯度下降估计参数的方法SGDRegressor在性能表现上不及使用解析方法的LinearRegression,但是如果面对训练数据规模十分庞大的任务,随机梯度法不论是在分类还是回归问题上都表现得十分高效,可以在不损失过多性能的前提下,节省大量计算时间。
- 特点分析:线性回归器是最为简单、易用的回归模型。正是因为其对特征与回归目标之间的线性假设,从某种程度上说也局限了其应用范围。特别是,现实生活中的许多实例数据的各个特征与回归目标之间,绝大多数不能保证严格的线性关系。尽管如此,在不清楚特征之间关系的前提下,我们仍然可以使用线性回归模型作为大多数科学试验的基线系统( Baseline system)。
支持向量机回归模型
代码语言:javascript复制#从sklearn.svm中导人支持向量机(回归)模型。
from sklearn. svm import SVR
#使用线性核函数配置的支持向量机进行回归训练,并且对测试样本进行预测。
linear_svr = SVR(kernel= 'linear')
linear_svr.fit(x_train, y_train)
linear_svr_y_predict = linear_svr.predict(x_test)
#使用多项式核函数配置的支持向量机进行回归训练,并且对测试样本进行预测。
poly_svr = SVR(kernel= 'poly')
poly_svr.fit(x_train, y_train)
poly_svr_y_predict=poly_svr.predict(x_test)
#使用径向基核函数配置的支持向量机进行回归训练,并且对测试样本进行预测。
rbf_svr = SVR(kernel= 'rbf')
rbf_svr.fit(x_train, y_train)
rbf_svr_y_predict = rbf_svr.predict(x_test)
代码语言:javascript复制C:ProgramDataAnaconda3libsite-packagessklearnutilsvalidation.py:: DataConversionWarning: A column-vector y was passed when a d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
性能评估
就不同核函数配置下的支持向量机回归模型在测试集上的回归性能做出评估,通过三组性能测评我们发现,不同配置下的模型在相同测试集上,存在着非常大的性能差异。并且在使用了径向基( Radialbasis function)核函数对特征进行非线性映射之后,支持向量机展现了最佳的回归性能。
代码语言:javascript复制#使用R-squared、MSE和MAE指标对三种配置的支持向量机(回归)模型在相同测试集上进行性能评估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print('R-squared value of linear SVR is', linear_svr.score(x_test, y_test))
print('The mean squared error of linear SVR is',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(linear_svr_y_predict)))
print('The mean absoluate error of linear SVR is',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(linear_svr_y_predict)))
代码语言:javascript复制R-squared value of linear SVR is 0.6506595464215432
The mean squared error of linear SVR is 27.088311013555618
The mean absoluate error of linear SVR is 3.432801387759937
代码语言:javascript复制print('R-squared value of Poly SVR is',poly_svr.score(x_test, y_test))
print('The mean squared error of Poly SVR is',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(poly_svr_y_predict)))
print('The mean absoluate error of Poly SVR is',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(poly_svr_y_predict)))
代码语言:javascript复制R-squared value of Poly SVR is 0.4036506510254898
The mean squared error of Poly SVR is 46.24170053104075
The mean absoluate error of Poly SVR is 3.738407371046593
代码语言:javascript复制print('R-squared value of RBF SVR is', rbf_svr.score(x_test, y_test))
print('The mean squared error of RBF SVR is',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(rbf_svr_y_predict)))
print('The mean absoluate error of RBF SVR is',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(rbf_svr_y_predict)))
代码语言:javascript复制R-squared value of RBF SVR is 0.7559887416340954
The mean squared error of RBF SVR is 18.920948861538655
The mean absoluate error of RBF SVR is 2.6067819999501105
K近邻分类模型
K近邻(回归)与K近邻(分类)一样,均属于无参数模型,同样没有没有参数训练过程。
使用两种不同配置的K近邻回归模型对美国波士顿房价数据进行回归预测
代码语言:javascript复制#从sklearn. neighbors导入KNeighborRegressor(K近邻回归器)。
from sklearn.neighbors import KNeighborsRegressor
#初始化K近邻回归器,并且调整配置,使得预测的方式为平均回归:weights='uniform'
uni_knr = KNeighborsRegressor(weights = 'uniform')
uni_knr.fit(x_train, y_train)
uni_knr_y_predict = uni_knr.predict(x_test)
#初始化K近邻回归器,并且调整配置,使得预测的方式为根据距离加权回归:weights='distance'。
dis_knr = KNeighborsRegressor(weights = 'distance')
dis_knr.fit(x_train, y_train)
dis_knr_y_predict = dis_knr.predict(x_test)
性能测评
就不同回归预测配置下的K近邻模型进行性能评估,其输出表明:相比之下,采用加权平均的方式回归房价具有更好的预测性能。
对两种不同配置的K近邻回归模型在美国波士顿房价数据上进行预测性能的评估
代码语言:javascript复制#使用R-squared、MSE以及MAE三种指标对平均回归配置的K近邻模型在测试集上进行性能评估。
print('R- squared value of uni form- weighted KNeighorRegression: ',uni_knr.score(x_test, y_test))
print(' The mean squared error of uni form - weighted KNeighorRegression: ',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(uni_knr_y_predict)))
print(' The mean absoluate error of uni form- weighted KNeighorRegression ',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(uni_knr_y_predict)))
代码语言:javascript复制R- squared value of uni form- weighted KNeighorRegression: 0.6907212176346006
The mean squared error of uni form - weighted KNeighorRegression: 23.981877165354337
The mean absoluate error of uni form- weighted KNeighorRegression 2.9650393700787396
代码语言:javascript复制#使用R-squared、MSE以及MAE三种指标对根据距离加权回归配置的K近邻模型在测试集上进行性能评估。
print('R- squared value of distance- weighted KNeighorRegression: ',
dis_knr.score(x_test, y_test))
print('The mean squared error of distance- weighted KNeighorRegression: ',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(dis_knr_y_predict)))
print('The mean absoluate error of distance- weighted KNeighorRegression: ' ,
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(dis_knr_y_predict)))
代码语言:javascript复制R- squared value of distance- weighted KNeighorRegression: 0.7201094821421604
The mean squared error of distance- weighted KNeighorRegression: 21.70307309049035
The mean absoluate error of distance- weighted KNeighorRegression: 2.8011255022108754
回归树进行分类
代码语言:javascript复制#从sklearn.tree中导人DecisionTreeRegressor.
from sklearn.tree import DecisionTreeRegressor
#使用默认配置初始化DecisionTreeRegressor.
dtr = DecisionTreeRegressor()
#用波士顿房价的训练数据构建回归树。
dtr.fit(x_train, y_train)
#使用默认配置的单一回归树对测试数据进行预测,并将预测值存储在变量dtr_y_predict中。
dtr_y_predict = dtr.predict(x_test)
性能测评
对默认配置的回归树在测试集上的性能做出评估,并且该代码的输出结果优于线性回归器LinearRegression与SGDRegressor的性能表现。因此,可以初步判断,“美国波士顿房价预测”问题的特征与目标值之间存在一定的非线性关系。
代码语言:javascript复制#使用R-squared、MSE以及MAE指标对默认配置的回归树在测试集上进行性能评估。
print(' R- squared value of DecisionTreeRegressor: ',dtr.score (x_test, y_test))
print(' The mean squared error of DecisionTreeRegressor: ',
mean_squared_error(ss_y. inverse_transform(y_test),
ss_y.inverse_transform(dtr_y_predict)))
print(' The mean absoluate error of DecisionTreeRegressor: ' ,
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(dtr_y_predict)))
代码语言:javascript复制 R- squared value of DecisionTreeRegressor: 0.7022130268545288
The mean squared error of DecisionTreeRegressor: 23.0907874015748
The mean absoluate error of DecisionTreeRegressor: 3.0937007874015747
树模型的优缺点
- 优点:①树模型可以解决非线性特征的问题;②树模型不要求对特征标准化和统一量化,即数值型和类别型特征都可以直接被应用在树模型的构建和预测过程中;③因为上述原因,树模型也可以直观地输出决策过程,使得预测结果具有可解释性。
- 缺点:①正是因为树模型可以解决复杂的非线性拟合问题,所以更加容易因为模型搭建过于复杂而丧失对新数据预测的精度(泛化力);②树模型从上至下的预测流程会因为数据细微的更改而发生较大的结构变化,因此预测稳定性较差;③依托训练数据构建最佳的树模型是NP难问题,即在有限时间内无法找到最优解的问题,因此我们所使用类似贪婪算法的解法只能找到一些次优解,这也是为什么我们经常借助集成模型,在多个次优解中寻觅更高的模型性能。
集成模型进行分类
使用三种集成回归模型,RandomForestRegressor、ExtraTreesRegressor以及Gradient BoostingRegressor对“美国波士顿房价”数据进行回归预测。
代码语言:javascript复制#从sklearn. ensemble中导人RandomForestRegressor. ExtraTreesGressor以及GradientBoost ingRegressor.
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor,GradientBoostingRegressor
#使用RandomForestRegressor训练模型,并对测试数据做出预测,结果存储在变量rfr_y_ predict中。
rfr = RandomForestRegressor()
rfr.fit(x_train, y_train)
rfr_y_predict = rfr.predict(x_test)
#使用ExtraTreesRegressor训练模型,并对测试数据做出预测,结果存储在变量etr__y_predict中。
etr = ExtraTreesRegressor()
etr.fit(x_train, y_train)
etr_y_predict = etr.predict(x_test)
#使用GradientBoostingRegressor训练模型,并对测试数据做出预测,结果存储在变量gbr_ y_ predict中。
gbr = GradientBoostingRegressor()
gbr.fit(x_train, y_train)
gbr_y_predict = gbr.predict(x_test)
代码语言:javascript复制C:ProgramDataAnaconda3libsite-packagessklearnensembleweight_boosting.py:: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:: DataConversionWarning: A column-vector y was passed when a d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
"""
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:9: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
if __name__ == '__main__':
C:ProgramDataAnaconda3libsite-packagessklearnutilsvalidation.py:: DataConversionWarning: A column-vector y was passed when a d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
性能评测
对三种集成回归模型在美国波士顿房价测试数据上的回归预测性能进行评估
代码语言:javascript复制#使用R-squared、MSE以及MAE指标对默认配置的随机回归森林在测试集上进行性能评估。
print(' R- squared value of RandomForestRegressor: ',rfr.score (x_test, y_test))
print(' The mean squared error of RandomForestRegressor: ',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(rfr_y_predict)))
print('The mean absoluate error of RandomForestRegressor:',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(rfr_y_predict)))
代码语言:javascript复制 R- squared value of RandomForestRegressor: 0.8313099192129644
The mean squared error of RandomForestRegressor: 13.080447244094488
The mean absoluate error of RandomForestRegressor: 2.4341732283464568
代码语言:javascript复制features = data.columns.tolist()
代码语言:javascript复制#使用R- squared、MSE以及MAE指标对默认配置的极端回归森林在测试集上进行性能评估。
print('R- squared value of ExtraTreesRegessor:', etr.score (x_test, y_test))
print(' The mean squared error of ExtraTreesRegessor: ',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(etr_y_predict)))
print(' The mean absoluate error of ExtraTreesRegessor: ',
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(etr_y_predict)))
#利用训练好的极端回归森林模型,输出每种特征对预测目标的贡献度。
print(np.sort(list(zip(etr.feature_importances_ , features))))
代码语言:javascript复制R- squared value of ExtraTreesRegessor: 0.786673179902566
The mean squared error of ExtraTreesRegessor: 16.541637795275587
The mean absoluate error of ExtraTreesRegessor: 2.482362204724409
[['0.02397108032077896' 'CRIM']
['0.0034229091216599035' 'ZN']
['0.037963996091105554' 'INDUS']
['0.02829992532927602' 'CHAS']
['0.026481752440140605' 'NOX']
['0.39223064060531565' 'RM']
['0.014899541666303811' 'AGE']
['0.02149362205441516' 'DIS']
['0.015445008612824782' 'RAD']
['0.03016119580457789' 'TAX']
['0.0548199923017705' 'PIRATIO']
['0.016796320061587565' 'B']
['0.33401401559024363' 'LSTAT']]
代码语言:javascript复制#使用R-squared.MSE以及MAE指标对默认配置的梯度提升回归树在测试集上进行性能评估。
print('R- squared value of GradientBoostingRegressor:', gbr.score(x_test, y_test))
print('The mean squared error of GradientBoost ingRegressor:',
mean_squared_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(gbr_y_predict)))
print(' The mean absoluate error of GradientBoostingRegressor: ' ,
mean_absolute_error(ss_y.inverse_transform(y_test),
ss_y.inverse_transform(gbr_y_predict)))
代码语言:javascript复制R- squared value of GradientBoostingRegressor: 0.8351291631782853
The mean squared error of GradientBoost ingRegressor: 12.784298122773137
The mean absoluate error of GradientBoostingRegressor: 2.303546619059316