机器学习入门数据集--2.波士顿房价

2019-03-04 10:21:16 浏览数 (1)

sklearn有一个较小的房价数据集,特征有13个维度。而这个在数据集中,特征维度是79,本文用了2种模型对数据进行处理,线性回归模型和随机森林;用了2种模型评判方法R2和MSE。通过实验数据表明,随机森林模型的效果更好,一种原因是随机森林的Bag模型有抗过拟合效果更好,另一方面房价特征较多,决策树模型可以得到更好的结果。

数据展示

波士顿房价数据集,sklearn中可以下载已经做好预处理的数据集。

代码语言:javascript复制
import sklearn
import numpy as np
from sklearn.datasets import load_boston
np.set_printoptions(suppress=True)
boston = load_boston()

print("data shape:{}".format(boston.data.shape))
print("target shape:{}".format(boston.target.shape))
print("line head 5:n{}".format(boston.data[:5]))
print("target head 5:n{}".format(boston.target[:5]))

查看结果:

代码语言:javascript复制
data shape:(506, 13)
target shape:(506,)
line head 5:
[[  0.00632  18.        2.31      0.        0.538     6.575    65.2
    4.09      1.      296.       15.3     396.9       4.98   ]
 [  0.02731   0.        7.07      0.        0.469     6.421    78.9
    4.9671    2.      242.       17.8     396.9       9.14   ]
 [  0.02729   0.        7.07      0.        0.469     7.185    61.1
    4.9671    2.      242.       17.8     392.83      4.03   ]
 [  0.03237   0.        2.18      0.        0.458     6.998    45.8
    6.0622    3.      222.       18.7     394.63      2.94   ]
 [  0.06905   0.        2.18      0.        0.458     7.147    54.2
    6.0622    3.      222.       18.7     396.9       5.33   ]]
target head 5:
[24.  21.6 34.7 33.4 36.2]

这个数据可以用任何一个简单模型进行处理,可以参考下面的文章。 https://www.jianshu.com/p/f828eae005a1?utm_campaign=haruki&utm_content=note&utm_medium=reader_share&utm_source=weixin_timeline&from=timeline 还有一个数据集,格式为csv,数据特征有80列,下面我们要处理这个格式的数据。

波士顿房价数据集

数据预处理

加载数据
代码语言:javascript复制
train_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/train.csv",index_col=0)
test_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/test.csv",index_col=0)
## read_csv加载csv文件
## index_col=0,指明第一列为id列
print(train_df.info())
##print(train_df.describe().T)
print(train_df['MSSubClass'].value_counts())
print(train_df['MSSubClass'].unique())
## unique 查看数据
## value_counts 数据统计
#数据预处理,训练集和测试集一起做数据预处理

all_df = pd.concat((train_df,test_df),axis=0)
print(all_df.shape)
print(all_df['MSSubClass'].value_counts())
print(all_df['MSSubClass'].unique())
print(pd.get_dummies(sb))

print(pd.concat((all_df['MSSubClass'][:5],pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass')[:5]),axis=1).T)
  • df.info() 查看多少列,每一个列的属性
代码语言:javascript复制
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9  KB
  • pd.get_dummies 对离散型特征进行哑编码(也叫独热编码one-hot)。由于pd的编码没有fit,transform等操作,需要将训练集和测试集联结。 以第一列MSSubClass为例,可以先用unique()或value_counts()函数查看值分布。 pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass') 对某一列进行编码,观察输出结果,首先将特征值排序,最小值20为[1,0,0...],第二小值30为[0,1,0,....]。数据结果如下:
代码语言:javascript复制
Id               1   2   3   4   5
MSSubClass      60  20  60  70  60
MSSubClass_20    0   1   0   0   0
MSSubClass_30    0   0   0   0   0
MSSubClass_40    0   0   0   0   0
MSSubClass_45    0   0   0   0   0
MSSubClass_50    0   0   0   0   0
MSSubClass_60    1   0   1   0   1
MSSubClass_70    0   0   0   1   0
MSSubClass_75    0   0   0   0   0
MSSubClass_80    0   0   0   0   0
MSSubClass_85    0   0   0   0   0
MSSubClass_90    0   0   0   0   0
MSSubClass_120   0   0   0   0   0
MSSubClass_150   0   0   0   0   0
MSSubClass_160   0   0   0   0   0
MSSubClass_180   0   0   0   0   0
MSSubClass_190   0   0   0   0   0
  • 查看空值
代码语言:javascript复制
all_dummy_df = pd.get_dummies(all_df)
# print(all_dummy_df.head())
print(all_dummy_df.shape)
print(all_dummy_df.isnull().sum().sort_values(ascending=False))
代码语言:javascript复制
all_df shape:(2919, 79)
LotFrontage              486
GarageYrBlt              159
MasVnrArea                23
BsmtFullBath               2
BsmtHalfBath               2
BsmtFinSF1                 1
BsmtFinSF2                 1
BsmtUnfSF                  1
TotalBsmtSF                1
GarageArea                 1
GarageCars                 1
Condition1_RRNe            0
Condition1_RRNn            0
  • 空值填充:平均值填充
代码语言:javascript复制
mean_cols = all_dummy_df.mean()
all_dummy_df = all_dummy_df.fillna(mean_cols)
  • 模型训练
代码语言:javascript复制
dummy_train_df = all_dummy_df[:train_len]

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
print("训练集评分:{}".format(lr.score(dummy_train_df,train_target)))

kf = KFold(n_splits=5, shuffle=True)
score_ndarray = cross_val_score(lr, dummy_train_df, train_target, cv=kf)
print(score_ndarray)
print(score_ndarray.mean())

输出结果:

代码语言:javascript复制
训练集评分:0.9332679645484127
[ 0.88669507 -1.54529853  0.90133954  0.84720817  0.86750469  0.92840145
  0.8299786   0.91205312  0.92400129  0.91065317  0.55149449  0.87645062
  0.48737113  0.82570995  0.91949504  0.8890254   0.79646233  0.94457746
  0.65656125  0.91573777]
0.7162711000424071
score

LinearRegression的评分为R^2,模型在训练集上可以达到0.93,但是最后的交叉验证只得到了0.71的分数,说明模型存在过拟合问题。

R2

查看R2源码:github

cross_val_score 交叉验证误差

由于R^2误差不能直接表达误差的大小,对比两个模型的MSE。线性回归和随机森林。

代码语言:javascript复制
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
#lr.fit(dummy_train_df,train_target)
#print("训练集评分:{}".format(lr.score(dummy_train_df,train_target,scor)))

kf = KFold(n_splits=5, shuffle=True)
score_ndarray = np.sqrt(-cross_val_score(lr, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
print(score_ndarray.mean())

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=200, max_features=3)
score_ndarray = np.sqrt(-cross_val_score(clf, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
print(score_ndarray.mean())

clf.fit(dummy_train_df,train_target)
train_predict = clf.predict(dummy_train_df)
from sklearn.metrics import mean_squared_error
print("随机森林算法的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))

lr.fit(dummy_train_df,train_target)
train_predict = lr.predict(dummy_train_df)
from sklearn.metrics import mean_squared_error
print("线性回归的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))
代码语言:javascript复制
随机森林算法的误差: 13134.86059929929
线性回归的误差: 20514.990603536615

总结

随机森林模型要比线性回归模型的结果好。

0 人点赞