sklearn有一个较小的房价数据集,特征有13个维度。而这个在数据集中,特征维度是79,本文用了2种模型对数据进行处理,线性回归模型和随机森林;用了2种模型评判方法R2和MSE。通过实验数据表明,随机森林模型的效果更好,一种原因是随机森林的Bag模型有抗过拟合效果更好,另一方面房价特征较多,决策树模型可以得到更好的结果。
数据展示
波士顿房价数据集,sklearn中可以下载已经做好预处理的数据集。
代码语言:javascript复制import sklearn
import numpy as np
from sklearn.datasets import load_boston
np.set_printoptions(suppress=True)
boston = load_boston()
print("data shape:{}".format(boston.data.shape))
print("target shape:{}".format(boston.target.shape))
print("line head 5:n{}".format(boston.data[:5]))
print("target head 5:n{}".format(boston.target[:5]))
查看结果:
代码语言:javascript复制data shape:(506, 13)
target shape:(506,)
line head 5:
[[ 0.00632 18. 2.31 0. 0.538 6.575 65.2
4.09 1. 296. 15.3 396.9 4.98 ]
[ 0.02731 0. 7.07 0. 0.469 6.421 78.9
4.9671 2. 242. 17.8 396.9 9.14 ]
[ 0.02729 0. 7.07 0. 0.469 7.185 61.1
4.9671 2. 242. 17.8 392.83 4.03 ]
[ 0.03237 0. 2.18 0. 0.458 6.998 45.8
6.0622 3. 222. 18.7 394.63 2.94 ]
[ 0.06905 0. 2.18 0. 0.458 7.147 54.2
6.0622 3. 222. 18.7 396.9 5.33 ]]
target head 5:
[24. 21.6 34.7 33.4 36.2]
这个数据可以用任何一个简单模型进行处理,可以参考下面的文章。 https://www.jianshu.com/p/f828eae005a1?utm_campaign=haruki&utm_content=note&utm_medium=reader_share&utm_source=weixin_timeline&from=timeline 还有一个数据集,格式为csv,数据特征有80列,下面我们要处理这个格式的数据。
波士顿房价数据集
数据预处理
加载数据
代码语言:javascript复制train_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/train.csv",index_col=0)
test_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/test.csv",index_col=0)
## read_csv加载csv文件
## index_col=0,指明第一列为id列
print(train_df.info())
##print(train_df.describe().T)
print(train_df['MSSubClass'].value_counts())
print(train_df['MSSubClass'].unique())
## unique 查看数据
## value_counts 数据统计
#数据预处理,训练集和测试集一起做数据预处理
all_df = pd.concat((train_df,test_df),axis=0)
print(all_df.shape)
print(all_df['MSSubClass'].value_counts())
print(all_df['MSSubClass'].unique())
print(pd.get_dummies(sb))
print(pd.concat((all_df['MSSubClass'][:5],pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass')[:5]),axis=1).T)
- df.info() 查看多少列,每一个列的属性
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9 KB
- pd.get_dummies 对离散型特征进行哑编码(也叫独热编码one-hot)。由于pd的编码没有fit,transform等操作,需要将训练集和测试集联结。
以第一列MSSubClass为例,可以先用unique()或value_counts()函数查看值分布。
pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass')
对某一列进行编码,观察输出结果,首先将特征值排序,最小值20为[1,0,0...],第二小值30为[0,1,0,....]。数据结果如下:
Id 1 2 3 4 5
MSSubClass 60 20 60 70 60
MSSubClass_20 0 1 0 0 0
MSSubClass_30 0 0 0 0 0
MSSubClass_40 0 0 0 0 0
MSSubClass_45 0 0 0 0 0
MSSubClass_50 0 0 0 0 0
MSSubClass_60 1 0 1 0 1
MSSubClass_70 0 0 0 1 0
MSSubClass_75 0 0 0 0 0
MSSubClass_80 0 0 0 0 0
MSSubClass_85 0 0 0 0 0
MSSubClass_90 0 0 0 0 0
MSSubClass_120 0 0 0 0 0
MSSubClass_150 0 0 0 0 0
MSSubClass_160 0 0 0 0 0
MSSubClass_180 0 0 0 0 0
MSSubClass_190 0 0 0 0 0
- 查看空值
all_dummy_df = pd.get_dummies(all_df)
# print(all_dummy_df.head())
print(all_dummy_df.shape)
print(all_dummy_df.isnull().sum().sort_values(ascending=False))
代码语言:javascript复制all_df shape:(2919, 79)
LotFrontage 486
GarageYrBlt 159
MasVnrArea 23
BsmtFullBath 2
BsmtHalfBath 2
BsmtFinSF1 1
BsmtFinSF2 1
BsmtUnfSF 1
TotalBsmtSF 1
GarageArea 1
GarageCars 1
Condition1_RRNe 0
Condition1_RRNn 0
- 空值填充:平均值填充
mean_cols = all_dummy_df.mean()
all_dummy_df = all_dummy_df.fillna(mean_cols)
- 模型训练
dummy_train_df = all_dummy_df[:train_len]
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
print("训练集评分:{}".format(lr.score(dummy_train_df,train_target)))
kf = KFold(n_splits=5, shuffle=True)
score_ndarray = cross_val_score(lr, dummy_train_df, train_target, cv=kf)
print(score_ndarray)
print(score_ndarray.mean())
输出结果:
代码语言:javascript复制训练集评分:0.9332679645484127
[ 0.88669507 -1.54529853 0.90133954 0.84720817 0.86750469 0.92840145
0.8299786 0.91205312 0.92400129 0.91065317 0.55149449 0.87645062
0.48737113 0.82570995 0.91949504 0.8890254 0.79646233 0.94457746
0.65656125 0.91573777]
0.7162711000424071
score
LinearRegression的评分为R^2,模型在训练集上可以达到0.93,但是最后的交叉验证只得到了0.71的分数,说明模型存在过拟合问题。
R2
查看R2源码:github
cross_val_score 交叉验证误差
由于R^2误差不能直接表达误差的大小,对比两个模型的MSE。线性回归和随机森林。
代码语言:javascript复制from sklearn.linear_model import LinearRegression
lr = LinearRegression()
#lr.fit(dummy_train_df,train_target)
#print("训练集评分:{}".format(lr.score(dummy_train_df,train_target,scor)))
kf = KFold(n_splits=5, shuffle=True)
score_ndarray = np.sqrt(-cross_val_score(lr, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
print(score_ndarray.mean())
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=200, max_features=3)
score_ndarray = np.sqrt(-cross_val_score(clf, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
print(score_ndarray.mean())
clf.fit(dummy_train_df,train_target)
train_predict = clf.predict(dummy_train_df)
from sklearn.metrics import mean_squared_error
print("随机森林算法的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))
lr.fit(dummy_train_df,train_target)
train_predict = lr.predict(dummy_train_df)
from sklearn.metrics import mean_squared_error
print("线性回归的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))
代码语言:javascript复制随机森林算法的误差: 13134.86059929929
线性回归的误差: 20514.990603536615
总结
随机森林模型要比线性回归模型的结果好。