大家好,我是Peter~
本文是一个极度适合入门数据分析的案例,采用的是经典数据集:泰坦尼克数据集(train部分)
,主要内容包含:
- 数据探索分析EDA
- 数据预处理和特征工程
- 建模与预测
- 超参数优化
- 集成学习思想
- 特征重要性排序
需要notebook源码和数据的请后台联系小编
<!--MORE-->
导入数据
In 1:
代码语言:txt复制import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
from dataprep.datasets import load_dataset # 内置数据集
from dataprep.eda import plot # 绘图
from dataprep.eda import plot_correlation # 相关性
from dataprep.eda import create_report # 分析报告
from dataprep.eda import plot_missing # 缺失值
import warnings
warnings.filterwarnings('ignore')
In 2:
代码语言:txt复制data = pd.read_csv("train.csv")
data.head()
Out2:
自动探索分析
基于dataprep的自动化数据探索分析,对数据有整体了解
In 3:
代码语言:txt复制data.shape # 数据量
Out3:
代码语言:txt复制(891, 12)
In 4:
代码语言:txt复制data.isnull().sum() # 缺失值情况
Out4:
代码语言:txt复制PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In 5:
代码语言:txt复制data.dtypes # 字段类型
Out5:
代码语言:txt复制PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
In 6:
代码语言:txt复制plot(data)
代码语言:python代码运行次数:0复制plot_correlation(data)
代码语言:python代码运行次数:0复制plot_missing(data)
特征探索
3类特征
- 分类特征
- 有序特征;比如身高的低中高(tall、medium、short)
- 连续型特征
目标变量Survived
In 9:
代码语言:txt复制# 到底有多少人生存?
f,ax=plt.subplots(1,2,figsize=(18,8))
# 图1
data['Survived'].value_counts().plot.pie(explode=[0,0.1]
,autopct='%1.1f%%'
,ax=ax[0],
shadow=True
)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
# 图2
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()
统计不同sex下的生存人数:
In 10:
代码语言:txt复制data.groupby(['Sex','Survived'])['Survived'].count() # 不同性别下的生存人数
Out10:
代码语言:txt复制Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
Survived vs Sex
In 11:
代码语言:txt复制f,ax=plt.subplots(1,2,figsize=(18,8)) # 1行2列 通过ax来指定
# 分组统计柱状图plot.bar
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
# 数量统计柱状图countplot
sns.countplot('Sex', # 待统计字段
hue='Survived', # 分类字段
data=data, # dataframe
ax=ax[1] # 指定是哪个子图
)
ax[1].set_title('Sex: Survived vs Dead')
plt.show()
Pclass:Survived vs Dead
In 12:
代码语言:txt复制# pandas如何实现透视表统计
pd.crosstab(data.Pclass,data.Survived,margins=True)
Out12:
Survived | 0 | 1 | All |
---|---|---|---|
Pclass | |||
1 | 80 | 136 | 216 |
2 | 97 | 87 | 184 |
3 | 372 | 119 | 491 |
All | 549 | 342 | 891 |
In 13:
代码语言:txt复制# 添加表格美化功能
pd.crosstab(data.Pclass,data.Survived,margins=True).style.background_gradient(cmap='RdYlGn_r')
代码语言:python代码运行次数:0复制f,ax=plt.subplots(1,2,figsize=(18,8))
# 图1 : value_counts().plot.bar实现
data['Pclass'].value_counts().plot.bar(color=['#AD7F32','#EFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
# 图2:sns.countplot实现
sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()
Survived based on Sex and Pclass
In 15:
代码语言:txt复制pd.crosstab([data.Sex, data.Survived], data.Pclass,margins=True)
Out15:
Pclass | 1 | 2 | 3 | All | |
---|---|---|---|---|---|
Sex | Survived | ||||
female | 0 | 3 | 6 | 72 | 81 |
1 | 91 | 70 | 72 | 233 | |
male | 0 | 77 | 91 | 300 | 468 |
1 | 45 | 17 | 47 | 109 | |
All | 216 | 184 | 491 | 891 |
In 16:
代码语言:python代码运行次数:0复制pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='YlGn_r')
代码语言:python代码运行次数:0复制fig = plt.figure(figsize=(12,6))
sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()
特征Age
属于连续型特征
In 18:
代码语言:txt复制data['Age'].max() # 最大值
Out18:
代码语言:txt复制80.0
In 19:
代码语言:txt复制data['Age'].min() # 最小值
Out19:
代码语言:txt复制0.42
In 20:
代码语言:txt复制data['Age'].mean() # 均值
Out20:
代码语言:txt复制29.69911764705882
In 21:
代码语言:txt复制f,ax=plt.subplots(1,2,figsize=(18,10))
# 小提琴图
sns.violinplot("Pclass","Age",
hue="Survived",
data=data,
split=True,
ax=ax[0])
ax[0].set_title('Survived Based on Pclass and Age')
ax[0].set_yticks(range(0,110,10))
# 小提琴图
sns.violinplot("Sex",
"Age",
hue="Survived",
data=data,
split=True,ax=ax[1])
ax[1].set_title('Survived Based on Sex and Age')
ax[1].set_yticks(range(0,110,10))
plt.show()
特征Name
In 22:
代码语言:txt复制data['Start']=0
for i in data:
# 提取姓名的字母部分;[点.]之前的部分;比如Miss,Lady等
data['Start']=data["Name"].str.extract('([A-Za-z] ).')
In 23:
代码语言:txt复制data["Start"].value_counts()
Out23:
代码语言:txt复制Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Major 2
Col 2
Countess 1
Capt 1
Ms 1
Sir 1
Lady 1
Mme 1
Don 1
Jonkheer 1
Name: Start, dtype: int64
In 24:
代码语言:txt复制pd.crosstab(data.Start,data.Sex)
Out24:
Sex | female | male |
---|---|---|
Start | ||
Capt | 0 | 1 |
Col | 0 | 2 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dr | 1 | 6 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 40 |
Miss | 182 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 517 |
Mrs | 125 | 0 |
Ms | 1 | 0 |
Rev | 0 | 6 |
Sir | 0 | 1 |
In 25:
代码语言:txt复制pd.crosstab(data.Start,data.Sex).T # 转置功能
Out25:
代码语言:python代码运行次数:0复制# 制作基于统计数量的透视表
pd.crosstab(data.Start,data.Sex).T.style.background_gradient(cmap='summer_r')
将统计产生的结果分为5大类:Master Miss Mr Mrs Other
In 27:
代码语言:txt复制data['Start'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer',
'Col','Rev','Capt','Sir','Don'], # 原数据
['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs',
'Other','Other','Other','Mr','Mr','Mr'], # 替代数据
inplace=True)
不同年龄段均值
In 28:
代码语言:txt复制data.groupby('Start')['Age'].mean()
Out28:
代码语言:txt复制Start
Master 4.574167
Miss 21.860000
Mr 32.739609
Mrs 35.981818
Other 45.888889
Name: Age, dtype: float64
根据不同Start的均值来填充对应分组下的缺失值:
In 29:
代码语言:txt复制# 代码可复用
data.loc[(data.Age.isnull())&(data.Start=='Master'),'Age']=5 # 对满足两个条件下Age字段的缺失值填充
data.loc[(data.Age.isnull())&(data.Start=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Start=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Start=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Start=='Other'),'Age']=46
In 30:
代码语言:txt复制data.Age.isnull().any() # 没有缺失值
Out30:
代码语言:txt复制False
不同年龄段的存活情况统计:
In 31:
代码语言:txt复制f,ax=plt.subplots(1,2,figsize=(20,10))
x=list(range(0,85,5))
# 图1-直方图
data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived = 0') # 标题
ax[0].set_xticks(x) # x轴ticks
data[data['Survived']==1].Age.plot.hist(ax=ax[1],bins=20,edgecolor='black',color='blue')
ax[1].set_title('Survived = 1')
ax[1].set_xticks(x)
plt.show()
代码语言:python代码运行次数:0复制sns.factorplot('Pclass','Survived',col='Start',data=data)
plt.show()
特征Embarked
In 33:
代码语言:txt复制(pd.crosstab([data.Embarked,data.Pclass],
[data.Sex,data.Survived],margins=True)
.style
.background_gradient(cmap='summer_r'))
代码语言:python代码运行次数:0复制sns.factorplot('Embarked','Survived',data=data)
fig=plt.gcf()
fig.set_size_inches(8,5)
plt.show()
代码语言:python代码运行次数:0复制f,ax=plt.subplots(2,2,figsize=(20,15))
# 图1:不同Embarked等级下人数
sns.countplot('Embarked',data=data,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
# 图2:不同不同Embarked等级下人数和Sex下的人数
sns.countplot('Embarked',hue='Sex',data=data,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
# 图3:不同Embarked,是否存活Survived人数统计
sns.countplot('Embarked',hue='Survived',data=data,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
# 图4:不同Embarked和Pclass人数统计
sns.countplot('Embarked',hue='Pclass',data=data,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
# 调整图形宽高
plt.subplots_adjust(wspace=0.25,hspace=0.5)
plt.show()
代码语言:python代码运行次数:0复制sns.factorplot('Pclass',
'Survived',
hue='Sex',
col='Embarked',
data=data
)
plt.show()
Embarked字段填充众数S:
In 37:
代码语言:txt复制data['Embarked'].fillna('S',inplace=True)
In 38:
代码语言:txt复制data.Embarked.isnull().any()
Out38:
代码语言:txt复制False
字段SibSip
In 39:
代码语言:python代码运行次数:0复制pd.crosstab([data.SibSp],data.Survived).style.background_gradient(cmap='summer_r')
代码语言:python代码运行次数:0复制f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp', 'Survived', data=data, ax=ax[0])
ax[0].set_title('Survived Based on SibSp')
sns.factorplot('SibSp', 'Survived', data=data, ax=ax[1])
ax[1].set_title('Survived Based on SibSp')
# plt.close(2)
plt.show()
代码语言:PYTHON复制pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='YlOrBr_r')
特征Parch
In 42:
代码语言:txt复制pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')
代码语言:PYTHON复制f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.show()
特征Fare
In 44:
代码语言:txt复制print('最高金额: ',data['Fare'].max())
print('最低金额: ',data['Fare'].min())
print('平均金额: ',data['Fare'].mean())
最高金额: 512.3292
最低金额: 0.0
平均金额: 32.2042079685746
In 45:
代码语言:txt复制f,ax=plt.subplots(1,3,figsize=(20,8))
# 绘制3个不同金额等级的图形
# 基于distplot密度直方图绘制
sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()
特征相关性corr
绘制特征相关性热力图
In 46:
代码语言:txt复制sns.heatmap(data.corr(), # 相关系数矩阵
annot=True,
cmap='RdBu_r',
linewidths=0.2
)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()
特征处理及衍生
年龄分段Age_band
In 47:
代码语言:txt复制data['Age_band']=0 # 给定初始值
# 在不同的年龄区间内进行分段:5段
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head()
Out47:
代码语言:PYTHON复制sns.countplot(data['Age_band'])
plt.show()
基于不同Age_band的生存情况:
In 49:
代码语言:txt复制sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()
家庭总人数Family_Size 是否单身Alone
In 50:
代码语言:txt复制data['Family_Size']=0 # 初始值
data['Family_Size']=data['Parch'] data['SibSp'] # 总人数
data['Alone']=0 # 初始值
data.loc[data.Family_Size==0,'Alone']=1 # 单身;仅自己一个人
In 51:
代码语言:txt复制# f,ax=plt.subplots(1,2,figsize=(18,6))↔
In 52:
代码语言:txt复制fig = plt.figure(figsize=(12,8))
sns.factorplot('Family_Size','Survived',data=data,ax=ax[0])
plt.title('Family_Size vs Survived')
plt.show()
<Figure size 864x576 with 0 Axes>
代码语言:PYTHON复制fig = plt.figure(figsize=(12,8))
sns.factorplot('Alone','Survived',data=data,ax=ax[0])
plt.title('Alone vs Survived')
plt.show()
代码语言:PYTHON复制# 是否单身对生存影响
sns.factorplot('Alone',
'Survived',
data=data,
hue='Sex',
col='Pclass')
plt.show()
票价Fare分箱
In 55:
代码语言:txt复制data['Fare_Range']=pd.qcut(data['Fare'],4) # 直接分4段
In 56:
代码语言:txt复制data["Fare_Range"].value_counts()
Out56:
代码语言:txt复制(7.91, 14.454] 224
(-0.001, 7.91] 223
(14.454, 31.0] 222
(31.0, 512.329] 222
Name: Fare_Range, dtype: int64
In 57:
代码语言:txt复制sns.countplot(data['Fare_Range'])
plt.show()
不同票价下的人数很均衡
In 58:
代码语言:txt复制# 票价分类
data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3
data.head()
Out58:
代码语言:PYTHON复制# 不同票价类别 性别下的生存情况
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()
字符特征转数值
In 60:
代码语言:txt复制# 直接替换
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Start'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
删除无用特征
删除对建模无效或者冗余的特征:
- Name
- Age:Age_band替换
- Ticket
- Fare:Fare_cat替换
- Cabin
- Fare_Range:Fare_cat替换
- PassengerId
In 61:
代码语言:txt复制data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'],axis=1,inplace=True)
特征相关性(新)
In 62:
代码语言:txt复制sns.heatmap(data.corr(), # 相关系数矩阵
annot=True,
cmap='RdBu_r',
linewidths=0.2
)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()
建模
In 63:
查看建模所用数据的基本信息:
In 64:
代码语言:txt复制data.shape
Out64:
代码语言:txt复制(891, 11)
In 65:
代码语言:txt复制data.isnull().sum()
Out65:
代码语言:txt复制Survived 0
Pclass 0
Sex 0
SibSp 0
Parch 0
Embarked 0
Start 0
Age_band 0
Family_Size 0
Alone 0
Fare_cat 0
dtype: int64
In 66:
代码语言:txt复制data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null int64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Embarked 891 non-null int64
6 Start 891 non-null int64
7 Age_band 891 non-null int64
8 Family_Size 891 non-null int64
9 Alone 891 non-null int64
10 Fare_cat 891 non-null int64
dtypes: int64(11)
memory usage: 76.7 KB
导入建模包
In 67:
代码语言:txt复制from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
切分数据
In 68:
代码语言:txt复制train,test=train_test_split(data,
test_size=0.3,
random_state=0,
stratify=data['Survived'] # 保持切分时候类别均衡,和整体数据保持一致
)
train_X=train[train.columns[1:]]
train_Y=train[train.columns[:1]]
test_X=test[test.columns[1:]]
test_Y=test[test.columns[:1]]
X=data[data.columns[1:]]
Y=data['Survived']
Radial Support Vector Machines(rbf-SVM)
radial-SVM:基于径向基核函数的SVM
In 69:
代码语言:txt复制model=svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_Y)
prediction1=model.predict(test_X)
In 70:
代码语言:txt复制metrics.accuracy_score(prediction1,test_Y) # 准确率
Out70:
代码语言:txt复制0.835820895522388
Linear Support Vector Machine(linear-SVM)
In 71:
代码语言:txt复制model=svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
In 72:
代码语言:txt复制metrics.accuracy_score(prediction2,test_Y) # 准确率
Out72:
代码语言:txt复制0.8171641791044776
Logistic Regression
In 73:
代码语言:txt复制model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3=model.predict(test_X)
In 74:
代码语言:txt复制metrics.accuracy_score(prediction3,test_Y)
Out74:
代码语言:txt复制0.8134328358208955
K-Nearest Neighbours(KNN)
In 75:
代码语言:txt复制model=KNeighborsClassifier()
model.fit(train_X,train_Y)
prediction4=model.predict(test_X)
In 76:
代码语言:txt复制metrics.accuracy_score(prediction4,test_Y)
Out76:
代码语言:txt复制0.8134328358208955
查看不同邻居个数k下的准确率情况:
In 77:
代码语言:txt复制a_index = list(range(1,11))
a = pd.Series()
for i in a_index:
model=KNeighborsClassifier(n_neighbors=i) # i个邻居
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
# 在不同邻居个数下求解出对应的准确率,进行比较;观察哪个下最高
a=a.append(pd.Series(metrics.accuracy_score(prediction,test_Y)))
plt.plot(a_index, a) # 绘图:
plt.xticks(a_index)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
查看不同的准确率得分:
In 78:
代码语言:txt复制a.values
Out78:
代码语言:txt复制array([0.76119403, 0.76865672, 0.79477612, 0.80597015, 0.81343284,
0.81343284, 0.82462687, 0.82089552, 0.8358209 , 0.84328358])
In 79:
代码语言:txt复制a.values.max()
Out79:
代码语言:txt复制0.8432835820895522
Gaussian Naive Bayes
In 80:
代码语言:txt复制model=GaussianNB()
model.fit(train_X,train_Y)
prediction5=model.predict(test_X)
In 81:
代码语言:txt复制metrics.accuracy_score(prediction5,test_Y)
Out81:
代码语言:txt复制0.8134328358208955
Decision Tree
In 82:
代码语言:txt复制model=DecisionTreeClassifier()
model.fit(train_X,train_Y)
prediction6=model.predict(test_X)
In 83:
代码语言:txt复制metrics.accuracy_score(prediction6,test_Y)
Out83:
代码语言:txt复制0.8022388059701493
Random Forests
In 84:
代码语言:txt复制model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_Y)
prediction7=model.predict(test_X)
In 85:
代码语言:txt复制metrics.accuracy_score(prediction7,test_Y)
Out85:
代码语言:txt复制0.8208955223880597
交叉验证
实施交叉验证
In 86:
代码语言:txt复制from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
kfold = KFold(n_splits=10, random_state=22, shuffle=True)
In 87:
代码语言:txt复制# 记录交叉验证的均值、模型准确率、标准差
cv_mean=[]
accuracy=[]
std=[]
# 不同分类模型
classifiers=['Linear Svm',
'Radial Svm',
'Logistic Regression',
'KNN',
'Decision Tree',
'Naive Bayes',
'Random Forest']
models=[svm.SVC(kernel='linear'),
svm.SVC(kernel='rbf'),
LogisticRegression(),
KNeighborsClassifier(n_neighbors=9),
DecisionTreeClassifier(),
GaussianNB(),
RandomForestClassifier(n_estimators=100)
]
# 遍历每个模型得到均值、准确率、标准差等信息
for model in models:
cv_result = cross_val_score(model,X,Y, cv = kfold,scoring = "accuracy")
# 3个统计值
cv_mean.append(cv_result.mean())
std.append(cv_result.std())
accuracy.append(cv_result)
new_models_df=pd.DataFrame({'CV Mean':cv_mean, 'Std':std},index=classifiers)
new_models_df
Out87:
CV Mean | Std | |
---|---|---|
Linear Svm | 0.784607 | 0.057841 |
Radial Svm | 0.828377 | 0.057096 |
Logistic Regression | 0.799176 | 0.040154 |
KNN | 0.812634 | 0.041063 |
Decision Tree | 0.809226 | 0.044548 |
Naive Bayes | 0.795843 | 0.054861 |
Random Forest | 0.811486 | 0.041164 |
结果可视化
In 88:
代码语言:txt复制plt.subplots(figsize=(12,6))
box=pd.DataFrame(accuracy, # 准确率
index=[classifiers] # 分类模型名
)
box.T.boxplot()
plt.show()
代码语言:PYTHON复制new_models_df['CV Mean'].plot.barh(width=0.7)
plt.title('Average CV Mean Accuracy')
fig=plt.gcf()
fig.set_size_inches(8,6)
plt.show()
混淆矩阵
在实施交叉验证后的混淆矩阵,查看分类效果:
In 90:
代码语言:txt复制f,ax=plt.subplots(3,3,figsize=(12,10))
y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')
y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')
y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for Logistic Regression')
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for KNN')
y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Naive Bayes')
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Random-Forests')
y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Decision Tree')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()
超参数优化
In 91:
代码语言:txt复制from sklearn.model_selection import GridSearchCV
SVM
In 92:
代码语言:txt复制# 待搜索的参数
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel=['rbf','linear']
# 参数组合的字典形式
hyper={'kernel':kernel,'C':C,'gamma':gamma}
# 网格搜索
gd=GridSearchCV(estimator=svm.SVC(), param_grid=hyper, verbose=True)
gd.fit(X,Y)
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
Out92:
代码语言:txt复制GridSearchCV(estimator=SVC(),
param_grid={'C': [0.05, 0.1, 0.2, 0.3, 0.25, 0.4, 0.5, 0.6, 0.7,
0.8, 0.9, 1],
'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0],
'kernel': ['rbf', 'linear']},
verbose=True)
查看最佳得分和参数组合:
In 93:
代码语言:txt复制print(gd.best_score_) # 最佳得分
print(gd.best_estimator_) # 最佳参数组合
0.8282593685267716
SVC(C=0.4, gamma=0.3)
Random Forests
In 94:
代码语言:txt复制n_estimators=range(100,1000,100)
hyper={'n_estimators': n_estimators}
gd=GridSearchCV(estimator=RandomForestClassifier(random_state=0),
param_grid=hyper,
verbose=True
)
gd.fit(X,Y)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
Out94:
代码语言:txt复制GridSearchCV(estimator=RandomForestClassifier(random_state=0),
param_grid={'n_estimators': range(100, 1000, 100)}, verbose=True)
In 95:
代码语言:txt复制print(gd.best_score_)
print(gd.best_estimator_)
0.819327098110602
RandomForestClassifier(n_estimators=300, random_state=0)
集成学习Ensembling
- Voting Classifier
- Bagging
- Boosting
Voting Classifier
In 96:
代码语言:txt复制from sklearn.ensemble import VotingClassifier
In 97:
代码语言:txt复制ensemble_model = VotingClassifier(estimators=[
('KNN', KNeighborsClassifier(n_neighbors=10)),
('SVM-R', svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),
('RF', RandomForestClassifier(n_estimators=500,random_state=0)),
('LR', LogisticRegression(C=0.05)),
('DT', DecisionTreeClassifier(random_state=0)),
('NB', GaussianNB()),
('SVM-L', svm.SVC(kernel='linear',probability=True))],
voting='soft').fit(train_X,train_Y)
In 98:
代码语言:txt复制ensemble_model.score(test_X,test_Y)
Out98:
代码语言:txt复制0.8246268656716418
In 99:
代码语言:txt复制# 交叉验证
# 对整体数据的交叉验证X-Y
cross=cross_val_score(ensemble_model,X,Y, cv = 10, scoring = "accuracy")
cross.mean()
Out99:
代码语言:txt复制0.8226716604244695
Bagging
Bagged KNN
In 100:
代码语言:txt复制from sklearn.ensemble import BaggingClassifier
In 101:
代码语言:txt复制model_knn=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),
random_state=0,
n_estimators=700
)
model_knn.fit(train_X,train_Y)
prediction=model_knn.predict(test_X)
In 102:
代码语言:txt复制metrics.accuracy_score(prediction,test_Y) # 准确率
Out102:
代码语言:txt复制0.832089552238806
In 103:
代码语言:txt复制# 交叉验证
result = cross_val_score(model_knn, X, Y, cv=10, scoring='accuracy')
result.mean()
Out103:
代码语言:txt复制0.8137952559300874
Bagged DecisionTree
In 104:
代码语言:txt复制model_dt = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
random_state=0,
n_estimators=100
)
model_dt.fit(train_X,train_Y)
prediction=model_dt.predict(test_X)
In 105:
代码语言:txt复制metrics.accuracy_score(prediction,test_Y) # 准确率
Out105:
代码语言:txt复制0.8208955223880597
In 106:
代码语言:txt复制# 交叉验证
result = cross_val_score(model_dt, X, Y, cv=10, scoring='accuracy')
result.mean()
Out106:
代码语言:txt复制0.8171410736579275
Boosting
AdaBoost(Adaptive Boosting)
In 107:
代码语言:txt复制from sklearn.ensemble import AdaBoostClassifier
In 108:
代码语言:txt复制ada = AdaBoostClassifier(n_estimators=200, random_state=0, learning_rate=0.1)
result=cross_val_score(ada, X, Y, cv=10,scoring='accuracy')
result.mean()
Out108:
代码语言:txt复制0.8249188514357055
Stochastic Gradient Boosting
In 109:
代码语言:txt复制from sklearn.ensemble import GradientBoostingClassifier
In 110:
代码语言:txt复制grad = GradientBoostingClassifier(n_estimators=500, random_state=0, learning_rate=0.1)
result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
result.mean()
Out110:
代码语言:txt复制0.8115230961298376
XGBoost
In 111:
代码语言:txt复制import xgboost as xg
xgboost=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
result=cross_val_score(xgboost,X,Y,cv=10,scoring='accuracy')
result.mean()
Out111:
代码语言:txt复制0.8160299625468165
通过3种模型的比较,我们发现AdaBoost的得分是最高的;下面进行超参数优化过程:
AdaBoost超参数优化
In 112:
代码语言:txt复制n_estimators = list(range(100,1100,100))
learn_rate = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper = {'n_estimators': n_estimators,
'learning_rate': learn_rate}
gd = GridSearchCV(estimator=AdaBoostClassifier(), param_grid=hyper, verbose=True)
gd.fit(X,Y)
Fitting 5 folds for each of 110 candidates, totalling 550 fits
Out112:
代码语言:txt复制GridSearchCV(estimator=AdaBoostClassifier(),
param_grid={'learning_rate': [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6,
0.7, 0.8, 0.9, 1],
'n_estimators': [100, 200, 300, 400, 500, 600, 700,
800, 900, 1000]},
verbose=True)
In 113:
代码语言:txt复制# 最高得分 和 最佳组合
print(gd.best_score_)
print(gd.best_estimator_)
0.8293892411022534
AdaBoostClassifier(learning_rate=0.1, n_estimators=100)
混淆矩阵(AdaBoost模型)
In 114:
代码语言:txt复制ada = AdaBoostClassifier(n_estimators=100,random_state=0,learning_rate=0.1)
result = cross_val_predict(ada,X,Y,cv=10)
sns.heatmap(confusion_matrix(Y, result), # 混淆矩阵数据
cmap='winter',
annot=True,
fmt='2.0f'
)
plt.show()
Feature Importance(4种树模型)
Feature Importance表示的特征重要性
In 115:
代码语言:PYTHON复制f,ax=plt.subplots(2,2,figsize=(15,12))
# 1、模型
rf=RandomForestClassifier(n_estimators=500,random_state=0)
# 2、训练
rf.fit(X,Y)
# 3、重要性排序
pd.Series(rf.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
# 4、添加标题
ax[0,0].set_title('Feature Importance in Random Forests')
ada=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
ada.fit(X,Y)
pd.Series(ada.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#9dff11')
ax[0,1].set_title('Feature Importance in AdaBoost')
gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
gbc.fit(X,Y)
pd.Series(gbc.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')
xgbc=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
xgbc.fit(X,Y)
pd.Series(xgbc.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')
plt.show()