kaggle实战-信用卡客户流失预警
带来一篇关于kaggle客户流失预测的数据分析与建模的文章
背景
近年来,不论是传统行业还是互联网行业,都面临着用户流失问题。一般在银行、电话服务公司、互联网公司、保险等公司,经常使用客户流失分析和客户流失率作为他们的关键性业务指标之一。
一般情况下,留住现有客户的成本是远低于获得新客户的成本。因此在这些公司都有自己的客户服务部门来挽回现有即将流失的客户,因为现有客户对公司来说比新客户更具有价值。
记住一点:获客成本高,用户留存很重要
导入库
In [1]:
代码语言:javascript复制import numpy as np
import pandas as pd
import plotly as py
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
In [2]:
代码语言:javascript复制df = pd.read_csv("BankChurners.csv")
df.head()
数据基本信息
代码语言:javascript复制df.shape
# 结果
(10127, 23)
结果显示总共是10127行数据,23个字段
In [3]:
代码语言:javascript复制# 全部字段
columns = df.columns
columns
Out[3]:
代码语言:javascript复制Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
'Dependent_count', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
dtype='object')
字段解释为:
- CLIENTNUM:Client number - Unique identifier for the customer holding the account
- Attrition_Flag:Flag indicative of account closure in next 6 months (between Jan to Jun 2013)
- Customer_Age:Age of the account holder
- Gender:Gender of the account holder
- Dependent_count:Number of people financially dependent on the account holder
- Education_Level:Educational qualification of account holder (ex - high school, college grad etc.)
- Marital_Status:Marital status of account holder (Single, Married, Divorced, Unknown)
- Income_Category:Annual income category of the account holder
- Card_Category:Card type depicting the variants of the cards by value proposition (Blue, Silver and Platinum)
- Months_on_book:Number of months since the account holder opened an an account with the lender
- Total_Relationship_Count:Total number of products held by the customer. Total number of relationships the account holder has with the bank (example - retail bank, mortgage, wealth management etc.)
- Months_Inactive_12_mon:Total number of months inactive in last 12 months
- Contacts_Count_12_mon:Number of Contacts in the last 12 months. No. of times the account holder called to the call center in the past 12 months
- Credit_Limit:Credit limit
- Total_Revolving_Bal:Total amount as revolving balance
- Avg_Open_To_Buy:Open to Buy Credit Line (Average of last 12 months)
- Total_Amt_Chng_Q4_Q1:Change in Transaction Amount (Q4 over Q1)
- Total_Trans_Amt:Total Transaction Amount (Last 12 months)
- Total_Trans_Ct:Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1:Change in Transaction Count (Q4 over Q1)
- Avg_Utilization_Ratio:Average Card Utilization Ratio
- Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
- Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
In [4]:
代码语言:javascript复制df.dtypes # 字段类型; 部分截图
通过下面的代码能够统计不同类型下的字段数量:
代码语言:javascript复制# 不同字段类型的统计
pd.value_counts(df.dtypes)
int64 10
float64 7
object 6
dtype: int64
代码语言:javascript复制df.describe().style.background_gradient(cmap="ocean_r") # 表格美化输出
df数据的描述统计信息美化输出(部分字段)
缺失值
In [7]:
代码语言:javascript复制# 每个字段的缺失值统计
df.isnull().sum()
# 缺失值比例:数据中没有缺失值
total = df.isnull().sum().sort_values(ascending=False)
Percentage = total / len(df)
根据值的降序排列,第一个是0,结果表明数据本身是没有缺失值的**
删除无关字段
In [9]:
代码语言:javascript复制no_use = np.arange(21, df.shape[1]) # 最后两个字段
no_use
Out[9]:
代码语言:javascript复制array([21, 22])
In [10]:
代码语言:javascript复制# 1、删除多个字段
df.drop(df.columns[no_use], axis=1, inplace=True)
In [11]:
CLIENTNUM表示的客户编号的信息,对建模无用直接删除:
代码语言:javascript复制# 2、删除单个字段
df.drop("CLIENTNUM", axis=1, inplace=True)
新生成的df的字段(删除了无效字段之后):
In [12]:
代码语言:javascript复制df.columns
Out[12]:
代码语言:javascript复制Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')
In [13]:
再次查看数据的描述统计信息:
代码语言:javascript复制df.describe().style.background_gradient(cmap="ocean_r")
EDA-Exploratory Data Analysis
基于使用频率和数值特征
In [14]:
取出和用户的数值型字段信息:
代码语言:javascript复制# df_frequency = df[["Customer_Age","Total_Trans_Ct","Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit","Attrition_Flag"]] 效果同下
df_frequency = pd.concat([df['Customer_Age'],
df['Total_Trans_Ct'],
df['Total_Trans_Amt'],
df['Months_Inactive_12_mon'],
df['Credit_Limit'],
df['Attrition_Flag']],
axis=1)
df_frequency.head()
探索在不同的Attrition_Flag下,两两字段之间的关系:
In [15]:
代码语言:javascript复制df["Attrition_Flag"].value_counts()
Out[15]:
代码语言:javascript复制Existing Customer 8500 # 现有顾客
Attrited Customer 1627 # 流失顾客
Name: Attrition_Flag, dtype: int64
结果表明:现有顾客为8500,流失客户为1627
In [16]:
代码语言:javascript复制# 定义画布大小
fig, ax = plt.subplots(ncols=4, figsize=(20,6))
sns.scatterplot(data=df_frequency,
x="Total_Trans_Amt",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[0])
sns.scatterplot(data=df_frequency,
x="Months_Inactive_12_mon",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[1])
sns.scatterplot(data=df_frequency,
x="Credit_Limit",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[2])
sns.scatterplot(data=df_frequency,
x="Customer_Age",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[3])
plt.show()
基于plotly的实现:
代码语言:javascript复制for col in ["Customer_Age","Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit"]:
fig = px.scatter(df_frequency,
x=col,
y="Total_Trans_Ct",
color="Attrition_Flag")
fig.show()
代码语言:javascript复制# 生成一个副本
df_frequency_copy = df_frequency.copy()
df_frequency_copy["Attrition_Flag_number"] = df_frequency_copy["Attrition_Flag"].apply(lambda x: 1 if x == "Existing Customer" else 2)
# 两个基本参数:设置行、列
four_columns = ["Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit","Customer_Age"]
fig = make_subplots(rows=1,
cols=4,
start_cell="top-left",
shared_yaxes=True,
subplot_titles=four_columns # 子图
)
for i, v in enumerate(four_columns):
r = i // 4 1 # 行
c = (i 1) % 4 # 列-余数
if c == 0:
fig.add_trace(go.Scatter(x=df_frequency_copy[v].tolist(),
y=df_frequency_copy["Total_Trans_Ct"].tolist(),
mode='markers',
marker=dict(color=df_frequency_copy.Attrition_Flag_number)),
row=r, col=4)
else:
fig.add_trace(go.Scatter(x=df_frequency_copy[v].tolist(),
y=df_frequency_copy["Total_Trans_Ct"].tolist(),
mode='markers',
marker=dict(color=df_frequency_copy.Attrition_Flag_number)),
row=r, col=c)
fig.update_layout(width=1000, height=450, showlegend=False)
fig.show()
蓝色:现有客户;黄色:流失客户
我们得到如下的几点结论:
- 图1:用户每年花费的金额越高,越可能留下来(非流失)
- 2-3个月不进行互动,用户流失的可能性较高
- 用户的信用额度越高,留下来的可能性越大
- 从图3中观察到:流失客户的信用卡使用次数大部分低于100次
- 从第4个图中观察到,用户年龄分布不是重要因素
基于用户人口统计信息
用户的人口统计信息主要是包含:用户年龄、性别、受教育程度、状态(单身、已婚等)、收入水平等信息
In [21]:
取出相关的字段进行分析:
代码语言:javascript复制df_demographic=df[['Customer_Age',
'Gender',
'Education_Level',
'Marital_Status',
'Income_Category',
'Attrition_Flag']]
df_demographic.head()
不同类型顾客的年龄分布
In [22]:
代码语言:javascript复制px.violin(df_demographic,
y="Customer_Age",
color="Attrition_Flag")
从上面的小提琴图看出来,不同类型的用户在年龄上的分布是类似的。
结论:年龄并不是用户是否流失的关键因素
年龄分布
查看整体数据中用户的年龄分布情况:
代码语言:javascript复制fig = make_subplots(rows=2, cols=1)
trace1=go.Box(x=df['Customer_Age'],name='Age With Box Plot',boxmean=True)
trace2=go.Histogram(x=df['Customer_Age'],name='Age With Histogram')
fig.add_trace(trace1, row=1,col=1)
fig.add_trace(trace2, row=2,col=1)
fig.update_layout(height=500, width=1000, title_text="用户年龄分布")
fig.show()
可以看到年龄基本上是呈现正态分布的,大多数集中在40-55之间。
不同类型下不同性别顾客统计
In [23]:
代码语言:javascript复制flag_gender = df.groupby(["Attrition_Flag","Gender"]).size().reset_index().rename(columns={0:"number"})
flag_gender
Out[23]:
Attrition_Flag | Gender | number | |
---|---|---|---|
0 | Attrited Customer | F | 930 |
1 | Attrited Customer | M | 697 |
2 | Existing Customer | F | 4428 |
3 | Existing Customer | M | 4072 |
In [24]:
代码语言:javascript复制fig = px.bar(flag_gender,
x="Attrition_Flag",
y="number",
color="Gender",
barmode="group",
text="number")
fig.show()
从上面的柱状图中看出来:
- 女性在本次数据中高于男性;在两种不同类型的客户中女性也是高于男性
- 数据不平衡:现有客户和流失客户是不平衡的,大约是8400:1600
交叉表统计分析
基于pandas中交叉表的数据统计分析。解释交叉表很好的文章:https://pbpython.com/pandas-crosstab.html
In [25]:
代码语言:javascript复制fig, (ax1,ax2,ax3,ax4) = plt.subplots(ncols=4, figsize=(20,5))
pd.crosstab(df["Attrition_Flag"],df["Gender"]).plot(kind="bar", ax=ax1, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Education_Level"]).plot(kind="bar", ax=ax2, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Marital_Status"]).plot(kind="bar", ax=ax3, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Income_Category"]).plot(kind="bar", ax=ax4, ylim=[0,5000])
fig, (ax1,ax2,ax3) = plt.subplots(ncols=3, figsize=(20,5))
pd.crosstab(df['Attrition_Flag'],df['Dependent_count']).plot(kind='bar',ax=ax1, ylim=[0,5000])
pd.crosstab(df['Attrition_Flag'],df['Card_Category']).plot(kind='bar',ax=ax2, ylim=[0,10000])
_box = sns.boxplot(data=df_demographic,x='Attrition_Flag',y='Customer_Age', ax=ax3)
plt.show()
可以观察到:在两种客户中,不同的教育水平和个人状态的分布是类似的。这个结论也验证了:年龄并不是影响现有或者流失客户的因素。
受教育程度
代码语言:javascript复制fig = px.pie(df,names='Education_Level',title='Propotion Of Education Levels')
fig.show()
对比两种客户数量
In [26]:
代码语言:javascript复制churn = df["Attrition_Flag"].value_counts()
churn
Out[26]:
代码语言:javascript复制Existing Customer 8500
Attrited Customer 1627
Name: Attrition_Flag, dtype: int64
In [27]:
代码语言:javascript复制churn.keys()
Out[27]:
代码语言:javascript复制Index(['Existing Customer', 'Attrited Customer'], dtype='object')
In [28]:
代码语言:javascript复制plt.pie(x=churn, labels=churn.keys(),autopct="%.1f%%")
plt.show()
上面的饼图表明:
- 现有客户还是占据了绝大部分
- 后面将通过采样的方式使得两种类型的客户数量保持平衡。
相关性
现有数据中的字段涉及到分类型和数值型,采取不同的分析和编码方式
- 数值型变量:使用相关系数Pearson
- 分类型变量:使用Cramer’s V ;克莱姆相关系数,常用于分析双变量之间的关系
参考内容:https://blog.csdn.net/deecheanW/article/details/120474864
代码语言:javascript复制# 字符型字段
# 相同效果:df.select_dtypes(include="O")
df_categorical=df.loc[:,df.dtypes==np.object]
df_categorical.head()
# 数值型字段
df_number = df.select_dtypes(exclude="O")
df_number.head()
对Attrition_Flag字段执行独热码编码操作:
In [31]:
代码语言:javascript复制# 先保留原信息
df_number["Attrition_Flag"] = df.loc[:, "Attrition_Flag"]
类型编码
In [34]:
代码语言:javascript复制from sklearn import preprocessing
label = preprocessing.LabelEncoder()
df_categorical_encoded = pd.DataFrame()
# 对分类型的字段进行类型编码
for i in df_categorical.columns:
df_categorical_encoded[i] = label.fit_transform(df_categorical[i])
计算克莱姆系数-cramers_V
In [35]:
代码语言:javascript复制from scipy.stats import chi2_contingency
# 定义计算克莱姆系数的函数
def cal_cramers_v(v1,v2):
crosstab = np.array(pd.crosstab(v1,v2,rownames=None,colnames=None))
stat = chi2_contingency(crosstab)[0]
obs = np.sum(crosstab)
mini = min(crosstab.shape) - 1
return stat / (obs * mini)
In [36]:
代码语言:javascript复制rows = []
for v1 in df_categorical_encoded:
col = []
for v2 in df_categorical_encoded:
# 计算克莱姆系数
cramers = cal_cramers_v(df_categorical_encoded[v1],df_categorical_encoded[v2])
col.append(round(cramers, 2))
rows.append(col)
In [37]:
代码语言:javascript复制# 克莱姆系数下的热力图
cramers_results = np.array(rows)
cramerv_matrix = pd.DataFrame(cramers_results,
columns=df_categorical_encoded.columns,
index=df_categorical_encoded.columns)
cramerv_matrix.head()
绘制相关的热力图:
代码语言:javascript复制mask = np.triu(np.ones_like(cramerv_matrix, dtype=np.bool))
cat_heatmap = sns.heatmap(cramerv_matrix, # 系数矩阵
mask=mask,
vmin=-1,
vmax=1,
annot=True,
cmap="BrBG")
cat_heatmap.set_title("Heatmap of Correlation(Categorical)", fontdict={"fontsize": 14}, pad=12)
plt.show()
代码语言:javascript复制# 基于数值型字段的相关系数
from scipy import stats
num_corr = df_number.corr() # 相关系数
plt.figure(figsize = (16,6))
mask = np.triu(np.ones_like(num_corr, dtype=np.bool))
heatmap_number = sns.heatmap(num_corr, mask=mask,
vmin=-1, vmax=1,
annot=True, cmap="RdYlBu")
heatmap_number.set_title("Heatmap of Correlation(Number)", fontdict={"fontsize": 14}, pad=12)
plt.show()
代码语言:javascript复制fig, ax = plt.subplots(ncols=2, figsize=(15,6))
heatmap = sns.heatmap(num_corr[["Existing Customer"]].sort_values(by="Existing Customer", ascending=False),
ax=ax[0],
vmin=-1,
vmax=1,
annot=True,
cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Existing Customers",fontdict={"fontsize":18}, pad=16);
heatmap = sns.heatmap(num_corr[["Attrited Customer"]].sort_values(by="Attrited Customer", ascending=False),
ax=ax[1],
vmin=-1,
vmax=1,
annot=True,
cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Attrited Customers",fontdict={"fontsize":18}, pad=16);
fig.tight_layout(pad=5)
plt.show()
小结:从上面右侧的热力图中能看到下面的字段和流失类型客户是无相关的。相关系数的值在正负0.1之间(右图)
- Credit Limit
- Average Open To Buy
- Months On Book
- Age
- Dependent Count
现在我们考虑将上面的字段进行删除:
In [41]:
代码语言:javascript复制df_model = df.copy()
df_model = df_model.drop(['Credit_Limit','Customer_Age','Avg_Open_To_Buy','Months_on_book','Dependent_count'],axis=1)
用户标识编码
In [42]:
代码语言:javascript复制df_model['Attrition_Flag'] = df_model['Attrition_Flag'].map({'Existing Customer': 1, 'Attrited Customer': 0})
剩余字段的独热码:
代码语言:javascript复制df_model=pd.get_dummies(df_model)
建模
切分数据
在之前已经验证过现有客户和流失客户的数量是不均衡的,我们使用SMOTE(Synthetic Minority Oversampling Technique,通过上采样合成少量的数据)采样来平衡数据。
In [50]:
代码语言:javascript复制from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
In [51]:
代码语言:javascript复制# 特征和目标变量
# X = df_model.drop("Attrition_Flag", axis=1, inplace=True)
X = df_model.loc[:, df_model.columns != "Attrition_Flag"]
y = df_model["Attrition_Flag"]
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
SMOTE采样
In [52]:
代码语言:javascript复制sm = SMOTE(sampling_strategy="minority", k_neighbors=20, random_state=42)
# 实施采样过程
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
3种模型
In [53]:
代码语言:javascript复制# 1、随机森林
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train_res, y_train_res)
Out[53]:
代码语言:javascript复制RandomForestClassifier()
一般在使用树模型建模的时候数据不需要归一化。但是在使用支持向量机的时候需要:
In [54]:
代码语言:javascript复制# 2、支持向量机
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# 使用支持向量机数据需要归一化
svm = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svm.fit(X_train_res, y_train_res)
Out[54]:
代码语言:javascript复制Pipeline(steps=[('standardscaler', StandardScaler()),
('svc', SVC(gamma='auto'))])
In [55]:
代码语言:javascript复制# 3、提升树
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, # tree的个数
learning_rate=1.0, # 学习率
max_depth=1, # 叶子的最大深度
random_state=42)
gb.fit(X_train_res, y_train_res)
Out[55]:
代码语言:javascript复制GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=42)
模型预测
In [56]:
代码语言:javascript复制y_rf = rf.predict(X_test)
y_svm = svm.predict(X_test)
y_gb = gb.predict(X_test)
混淆矩阵
In [57]:
代码语言:javascript复制from sklearn.metrics import plot_confusion_matrix
fig,ax=plt.subplots(ncols=3, figsize=(20,6))
plot_confusion_matrix(rf, X_test, y_test, ax=ax[0])
ax[0].title.set_text('RF')
plot_confusion_matrix(svm, X_test, y_test, ax=ax[1])
ax[1].title.set_text('SVM')
plot_confusion_matrix(gb, X_test, y_test, ax=ax[2])
ax[2].title.set_text('GB')
fig.tight_layout(pad=5)
plt.show()
分类模型得分
In [58]:
代码语言:javascript复制# classification_report, recall_score, precision_score, f1_score
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score
print('Random Forest Classifier')
print(classification_report(y_test, y_rf))
print('------------------------')
print('Support Vector Machine')
print(classification_report(y_test, y_svm))
print('------------------------')
print('Gradient Boosting')
print(classification_report(y_test, y_gb))
Random Forest Classifier
precision recall f1-score support
0 0.85 0.83 0.84 541
1 0.97 0.97 0.97 2801
accuracy 0.95 3342
macro avg 0.91 0.90 0.90 3342
weighted avg 0.95 0.95 0.95 3342
------------------------
Support Vector Machine
precision recall f1-score support
0 0.81 0.55 0.66 541
1 0.92 0.98 0.95 2801
accuracy 0.91 3342
macro avg 0.87 0.76 0.80 3342
weighted avg 0.90 0.91 0.90 3342
------------------------
Gradient Boosting
precision recall f1-score support
0 0.83 0.84 0.84 541
1 0.97 0.97 0.97 2801
accuracy 0.95 3342
macro avg 0.90 0.90 0.90 3342
weighted avg 0.95 0.95 0.95 3342
从3种模型的混淆矩阵和分类模型的相关评价指标来看:可以看到随机森林和提升树的结果都是优于支持向量机的
模型调参优化
针对随机森林和提升树模型采用两种不同的调参优化方法:
- 随机森林:随机搜索调参
- 梯度提升树:网格搜索调参
随机搜索调参-随机森林模型
In [59]:
代码语言:javascript复制from sklearn.model_selection import RandomizedSearchCV
设置不同待调参数的取值:
In [60]:
代码语言:javascript复制n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# n_estimators # 随机森林中树的个数
max_features = ['auto', 'sqrt']
In [61]:
代码语言:javascript复制# 每个tree的最大叶子数
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
max_depth
Out[61]:
代码语言:javascript复制[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None]
In [62]:
代码语言:javascript复制min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
随机搜索参数
In [64]:
代码语言:javascript复制random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
搜索结果如下:
In [65]:
代码语言:javascript复制rf_random = RandomizedSearchCV(
estimator = rf, # rf模型
param_distributions=random_grid, # 搜索参数
n_iter=30,
cv=3,
verbose=2,
random_state=42,
n_jobs=-1)
rf_random.fit(X_train_res, y_train_res)
print(rf_random.best_params_)
Fitting 3 folds for each of 30 candidates, totalling 90 fits
# 结果
{'n_estimators': 1400,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 110,
'bootstrap': True}
使用搜索参数建模
使用上面搜索之后的参数再次建模:
In [67]:
代码语言:javascript复制rf_clf_search= RandomForestClassifier(n_estimators=1400,
min_samples_split=2,
min_samples_leaf=1,
max_features='auto',
max_depth=110,
bootstrap=True)
rf_clf_search.fit(X_train_res,y_train_res)
y_rf_opt=rf_clf_search.predict(X_test)
print('Random Forest Classifier (Optimized)')
print(classification_report(y_test, y_rf_opt))
_rf_opt=plot_confusion_matrix(rf_clf_search, X_test, y_test)
Random Forest Classifier (Optimized)
precision recall f1-score support
0 0.86 0.84 0.85 541
1 0.97 0.97 0.97 2801
accuracy 0.95 3342
macro avg 0.91 0.90 0.91 3342
weighted avg 0.95 0.95 0.95 3342
调参后的混淆矩阵:左上角的449变成452,说明分类的更加准确了
网格搜索调参-提升树模型
网格搜索参数
In [68]:
代码语言:javascript复制from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators':range(20,100,10)}
param_test1
Out[68]:
代码语言:javascript复制{'n_estimators': range(20, 100, 10)}
In [69]:
代码语言:javascript复制# 实施搜索
grid_search1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=1.0, # 待搜索模型
min_samples_split=500,
min_samples_leaf=50,
max_depth=8,
max_features='sqrt',
subsample=0.8,
random_state=10),
param_grid = param_test1, # 搜索参数
scoring='roc_auc',
n_jobs=4,
cv=5)
grid_search1.fit(X_train_res,y_train_res)
grid_search1.best_params_
Out[69]:
代码语言:javascript复制{'n_estimators': 90}
使用搜索参数建模
In [71]:
代码语言:javascript复制gb_clf_opt=GradientBoostingClassifier(n_estimators=90, # 搜索到的参数90
learning_rate=1.0,
min_samples_split=500,
min_samples_leaf=50,
max_depth=8,
max_features='sqrt',
subsample=0.8,
random_state=10)
# 再次拟合
gb_clf_opt.fit(X_train_res,y_train_res)
y_gb_opt=gb_clf_opt.predict(X_test)
print('Gradient Boosting (Optimized)')
print(classification_report(y_test, y_gb_opt))
print(recall_score(y_test,y_gb_opt,pos_label=0))
_gbopt=plot_confusion_matrix(gb_clf_opt, X_test, y_test)
_gbopt
# 结果
Gradient Boosting (Optimized)
precision recall f1-score support
0 0.85 0.84 0.85 541
1 0.97 0.97 0.97 2801
accuracy 0.95 3342
macro avg 0.91 0.91 0.91 3342
weighted avg 0.95 0.95 0.95 3342
0.8428835489833642
左上角的分类数目从454提升到456,也有一定的提升,但是效果并不是很明显
总结
本文从一份用户相关的数据出发,从数据预处理、特征工程和编码,到建模分析和调参优化,完成了整个用户流失预警的全流程分析。整体模型的结果准确率达到了95%,召回率也达到了84.2%。肯定还有提升的空间,欢迎一起讨论~