kaggle实战-银行用户流失预测

2023-08-25 11:43:57 浏览数 (3)

kaggle实战-信用卡客户流失预警

带来一篇关于kaggle客户流失预测的数据分析与建模的文章

背景

近年来,不论是传统行业还是互联网行业,都面临着用户流失问题。一般在银行、电话服务公司、互联网公司、保险等公司,经常使用客户流失分析和客户流失率作为他们的关键性业务指标之一。

一般情况下,留住现有客户的成本是远低于获得新客户的成本。因此在这些公司都有自己的客户服务部门来挽回现有即将流失的客户,因为现有客户对公司来说比新客户更具有价值。

记住一点:获客成本高,用户留存很重要

导入库

In [1]:

代码语言:javascript复制
import numpy as np
import pandas as pd

import plotly as py
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
import scikitplot as skplt

In [2]:

代码语言:javascript复制
df = pd.read_csv("BankChurners.csv")
df.head()

数据基本信息

代码语言:javascript复制
df.shape

# 结果
(10127, 23)

结果显示总共是10127行数据,23个字段

In [3]:

代码语言:javascript复制
# 全部字段
columns = df.columns
columns

Out[3]:

代码语言:javascript复制
Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
      dtype='object')

字段解释为:

  • CLIENTNUM:Client number - Unique identifier for the customer holding the account
  • Attrition_Flag:Flag indicative of account closure in next 6 months (between Jan to Jun 2013)
  • Customer_Age:Age of the account holder
  • Gender:Gender of the account holder
  • Dependent_count:Number of people financially dependent on the account holder
  • Education_Level:Educational qualification of account holder (ex - high school, college grad etc.)
  • Marital_Status:Marital status of account holder (Single, Married, Divorced, Unknown)
  • Income_Category:Annual income category of the account holder
  • Card_Category:Card type depicting the variants of the cards by value proposition (Blue, Silver and Platinum)
  • Months_on_book:Number of months since the account holder opened an an account with the lender
  • Total_Relationship_Count:Total number of products held by the customer. Total number of relationships the account holder has with the bank (example - retail bank, mortgage, wealth management etc.)
  • Months_Inactive_12_mon:Total number of months inactive in last 12 months
  • Contacts_Count_12_mon:Number of Contacts in the last 12 months. No. of times the account holder called to the call center in the past 12 months
  • Credit_Limit:Credit limit
  • Total_Revolving_Bal:Total amount as revolving balance
  • Avg_Open_To_Buy:Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1:Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt:Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct:Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1:Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio:Average Card Utilization Ratio
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2

In [4]:

代码语言:javascript复制
df.dtypes   # 字段类型; 部分截图

通过下面的代码能够统计不同类型下的字段数量:

代码语言:javascript复制
# 不同字段类型的统计

pd.value_counts(df.dtypes)

int64      10
float64     7
object      6
dtype: int64
代码语言:javascript复制
df.describe().style.background_gradient(cmap="ocean_r")  # 表格美化输出

df数据的描述统计信息美化输出(部分字段)

缺失值

In [7]:

代码语言:javascript复制
# 每个字段的缺失值统计
df.isnull().sum()

# 缺失值比例:数据中没有缺失值
total = df.isnull().sum().sort_values(ascending=False)
Percentage = total / len(df)

根据值的降序排列,第一个是0,结果表明数据本身是没有缺失值的**

删除无关字段

In [9]:

代码语言:javascript复制
no_use = np.arange(21, df.shape[1])  # 最后两个字段
no_use

Out[9]:

代码语言:javascript复制
array([21, 22])

In [10]:

代码语言:javascript复制
# 1、删除多个字段
df.drop(df.columns[no_use], axis=1, inplace=True)

In [11]:

CLIENTNUM表示的客户编号的信息,对建模无用直接删除:

代码语言:javascript复制
# 2、删除单个字段
df.drop("CLIENTNUM", axis=1, inplace=True)

新生成的df的字段(删除了无效字段之后):

In [12]:

代码语言:javascript复制
df.columns

Out[12]:

代码语言:javascript复制
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
       'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
       'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

In [13]:

再次查看数据的描述统计信息:

代码语言:javascript复制
df.describe().style.background_gradient(cmap="ocean_r")

EDA-Exploratory Data Analysis

基于使用频率和数值特征

In [14]:

取出和用户的数值型字段信息:

代码语言:javascript复制
# df_frequency = df[["Customer_Age","Total_Trans_Ct","Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit","Attrition_Flag"]]  效果同下

df_frequency = pd.concat([df['Customer_Age'],
                        df['Total_Trans_Ct'],
                        df['Total_Trans_Amt'],
                        df['Months_Inactive_12_mon'],
                        df['Credit_Limit'],
                        df['Attrition_Flag']],
                       axis=1)

df_frequency.head()

探索在不同的Attrition_Flag下,两两字段之间的关系:

In [15]:

代码语言:javascript复制
df["Attrition_Flag"].value_counts()

Out[15]:

代码语言:javascript复制
Existing Customer    8500  # 现有顾客
Attrited Customer    1627  # 流失顾客
Name: Attrition_Flag, dtype: int64

结果表明:现有顾客为8500,流失客户为1627

In [16]:

代码语言:javascript复制
# 定义画布大小

fig, ax = plt.subplots(ncols=4, figsize=(20,6))

sns.scatterplot(data=df_frequency,
                x="Total_Trans_Amt",
                y="Total_Trans_Ct",
                hue="Attrition_Flag",
                ax=ax[0])

sns.scatterplot(data=df_frequency,
                x="Months_Inactive_12_mon",
                y="Total_Trans_Ct",
                hue="Attrition_Flag",
                ax=ax[1])

sns.scatterplot(data=df_frequency,
                x="Credit_Limit",
                y="Total_Trans_Ct",
                hue="Attrition_Flag",
                ax=ax[2])

sns.scatterplot(data=df_frequency,
              x="Customer_Age",
              y="Total_Trans_Ct",
              hue="Attrition_Flag",
              ax=ax[3])

plt.show()

基于plotly的实现:

代码语言:javascript复制
for col in ["Customer_Age","Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit"]:
    fig = px.scatter(df_frequency,
                     x=col,
                     y="Total_Trans_Ct",
                     color="Attrition_Flag")
    fig.show()
上main展示的一个字段和Total_Trans_Ct的关系。下面是基于go.Scatter实现:
代码语言:javascript复制
# 生成一个副本

df_frequency_copy = df_frequency.copy()
df_frequency_copy["Attrition_Flag_number"] = df_frequency_copy["Attrition_Flag"].apply(lambda x: 1 if x == "Existing Customer" else 2)

# 两个基本参数:设置行、列

four_columns = ["Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit","Customer_Age"]

fig = make_subplots(rows=1,
                    cols=4,
                    start_cell="top-left",
                    shared_yaxes=True,
                    subplot_titles=four_columns  # 子图
                   )

for i, v in enumerate(four_columns):
    r = i // 4   1  # 行
    c = (i   1) % 4  # 列-余数

    if c == 0:
        fig.add_trace(go.Scatter(x=df_frequency_copy[v].tolist(),
                             y=df_frequency_copy["Total_Trans_Ct"].tolist(),
                             mode='markers',
                             marker=dict(color=df_frequency_copy.Attrition_Flag_number)),
                 row=r, col=4)

    else:
        fig.add_trace(go.Scatter(x=df_frequency_copy[v].tolist(),
                             y=df_frequency_copy["Total_Trans_Ct"].tolist(),
                             mode='markers',
                             marker=dict(color=df_frequency_copy.Attrition_Flag_number)),

                 row=r, col=c)

fig.update_layout(width=1000, height=450, showlegend=False)

fig.show()

蓝色:现有客户;黄色:流失客户

我们得到如下的几点结论:

  1. 图1:用户每年花费的金额越高,越可能留下来(非流失)
  2. 2-3个月不进行互动,用户流失的可能性较高
  3. 用户的信用额度越高,留下来的可能性越大
  4. 从图3中观察到:流失客户的信用卡使用次数大部分低于100次
  5. 从第4个图中观察到,用户年龄分布不是重要因素

基于用户人口统计信息

用户的人口统计信息主要是包含:用户年龄、性别、受教育程度、状态(单身、已婚等)、收入水平等信息

In [21]:

取出相关的字段进行分析:

代码语言:javascript复制
df_demographic=df[['Customer_Age',
                   'Gender',
                   'Education_Level',
                   'Marital_Status',
                   'Income_Category',
                   'Attrition_Flag']]

df_demographic.head()
不同类型顾客的年龄分布

In [22]:

代码语言:javascript复制
px.violin(df_demographic,
          y="Customer_Age",
          color="Attrition_Flag")

从上面的小提琴图看出来,不同类型的用户在年龄上的分布是类似的。

结论:年龄并不是用户是否流失的关键因素

年龄分布

查看整体数据中用户的年龄分布情况:

代码语言:javascript复制
fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Customer_Age'],name='Age With Box Plot',boxmean=True)
trace2=go.Histogram(x=df['Customer_Age'],name='Age With Histogram')

fig.add_trace(trace1, row=1,col=1)
fig.add_trace(trace2, row=2,col=1)

fig.update_layout(height=500, width=1000, title_text="用户年龄分布")
fig.show()

可以看到年龄基本上是呈现正态分布的,大多数集中在40-55之间。

不同类型下不同性别顾客统计

In [23]:

代码语言:javascript复制
flag_gender = df.groupby(["Attrition_Flag","Gender"]).size().reset_index().rename(columns={0:"number"})
flag_gender

Out[23]:

Attrition_Flag

Gender

number

0

Attrited Customer

F

930

1

Attrited Customer

M

697

2

Existing Customer

F

4428

3

Existing Customer

M

4072

In [24]:

代码语言:javascript复制
fig = px.bar(flag_gender,
             x="Attrition_Flag",
             y="number",
             color="Gender",
             barmode="group",
             text="number")

fig.show()

从上面的柱状图中看出来:

  1. 女性在本次数据中高于男性;在两种不同类型的客户中女性也是高于男性
  2. 数据不平衡:现有客户和流失客户是不平衡的,大约是8400:1600

交叉表统计分析

基于pandas中交叉表的数据统计分析。解释交叉表很好的文章:https://pbpython.com/pandas-crosstab.html

In [25]:

代码语言:javascript复制
fig, (ax1,ax2,ax3,ax4) = plt.subplots(ncols=4, figsize=(20,5))

pd.crosstab(df["Attrition_Flag"],df["Gender"]).plot(kind="bar", ax=ax1, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Education_Level"]).plot(kind="bar", ax=ax2, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Marital_Status"]).plot(kind="bar", ax=ax3, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Income_Category"]).plot(kind="bar", ax=ax4, ylim=[0,5000])


fig, (ax1,ax2,ax3) = plt.subplots(ncols=3, figsize=(20,5))
pd.crosstab(df['Attrition_Flag'],df['Dependent_count']).plot(kind='bar',ax=ax1, ylim=[0,5000])
pd.crosstab(df['Attrition_Flag'],df['Card_Category']).plot(kind='bar',ax=ax2, ylim=[0,10000])

_box = sns.boxplot(data=df_demographic,x='Attrition_Flag',y='Customer_Age', ax=ax3)

plt.show()

可以观察到:在两种客户中,不同的教育水平和个人状态的分布是类似的。这个结论也验证了:年龄并不是影响现有或者流失客户的因素

受教育程度

代码语言:javascript复制
fig = px.pie(df,names='Education_Level',title='Propotion Of Education Levels')
fig.show()

对比两种客户数量

In [26]:

代码语言:javascript复制
churn = df["Attrition_Flag"].value_counts()
churn

Out[26]:

代码语言:javascript复制
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64

In [27]:

代码语言:javascript复制
churn.keys()

Out[27]:

代码语言:javascript复制
Index(['Existing Customer', 'Attrited Customer'], dtype='object')

In [28]:

代码语言:javascript复制
plt.pie(x=churn, labels=churn.keys(),autopct="%.1f%%")

plt.show()

上面的饼图表明:

  • 现有客户还是占据了绝大部分
  • 后面将通过采样的方式使得两种类型的客户数量保持平衡。

相关性

现有数据中的字段涉及到分类型和数值型,采取不同的分析和编码方式

  • 数值型变量:使用相关系数Pearson
  • 分类型变量:使用Cramer’s V ;克莱姆相关系数,常用于分析双变量之间的关系

参考内容:https://blog.csdn.net/deecheanW/article/details/120474864

代码语言:javascript复制
# 字符型字段
# 相同效果:df.select_dtypes(include="O")
df_categorical=df.loc[:,df.dtypes==np.object]
df_categorical.head()

# 数值型字段
df_number = df.select_dtypes(exclude="O")
df_number.head()

对Attrition_Flag字段执行独热码编码操作:

In [31]:

代码语言:javascript复制
# 先保留原信息
df_number["Attrition_Flag"] = df.loc[:, "Attrition_Flag"]

类型编码

In [34]:

代码语言:javascript复制
from sklearn import preprocessing

label = preprocessing.LabelEncoder()
df_categorical_encoded = pd.DataFrame()

# 对分类型的字段进行类型编码
for i in df_categorical.columns:
    df_categorical_encoded[i] = label.fit_transform(df_categorical[i])

计算克莱姆系数-cramers_V

In [35]:

代码语言:javascript复制
from scipy.stats import chi2_contingency

# 定义计算克莱姆系数的函数
def cal_cramers_v(v1,v2):
    crosstab = np.array(pd.crosstab(v1,v2,rownames=None,colnames=None))
    stat = chi2_contingency(crosstab)[0]

    obs = np.sum(crosstab)
    mini = min(crosstab.shape) - 1

    return stat / (obs * mini)

In [36]:

代码语言:javascript复制
rows = []
for v1 in df_categorical_encoded:
    col = []
    for v2 in df_categorical_encoded:
        # 计算克莱姆系数
        cramers = cal_cramers_v(df_categorical_encoded[v1],df_categorical_encoded[v2])
        col.append(round(cramers, 2))
    rows.append(col)

In [37]:

代码语言:javascript复制
# 克莱姆系数下的热力图

cramers_results = np.array(rows)

cramerv_matrix = pd.DataFrame(cramers_results,
                              columns=df_categorical_encoded.columns,
                              index=df_categorical_encoded.columns)
cramerv_matrix.head()

绘制相关的热力图:

代码语言:javascript复制
mask = np.triu(np.ones_like(cramerv_matrix, dtype=np.bool))
cat_heatmap = sns.heatmap(cramerv_matrix, # 系数矩阵
                          mask=mask,
                          vmin=-1,
                          vmax=1,
                          annot=True,
                          cmap="BrBG")

cat_heatmap.set_title("Heatmap of Correlation(Categorical)", fontdict={"fontsize": 14}, pad=12)

plt.show()
代码语言:javascript复制
# 基于数值型字段的相关系数

from scipy import stats

num_corr = df_number.corr()  # 相关系数
plt.figure(figsize = (16,6))

mask = np.triu(np.ones_like(num_corr, dtype=np.bool))
heatmap_number = sns.heatmap(num_corr, mask=mask,
                             vmin=-1, vmax=1,
                             annot=True, cmap="RdYlBu")

heatmap_number.set_title("Heatmap of Correlation(Number)", fontdict={"fontsize": 14}, pad=12)

plt.show()
代码语言:javascript复制
fig, ax = plt.subplots(ncols=2, figsize=(15,6))

heatmap = sns.heatmap(num_corr[["Existing Customer"]].sort_values(by="Existing Customer", ascending=False),
                     ax=ax[0],
                     vmin=-1,
                     vmax=1,
                     annot=True,
                     cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Existing Customers",fontdict={"fontsize":18}, pad=16);

heatmap = sns.heatmap(num_corr[["Attrited Customer"]].sort_values(by="Attrited Customer", ascending=False),
                     ax=ax[1],
                     vmin=-1,
                     vmax=1,
                     annot=True,
                     cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Attrited Customers",fontdict={"fontsize":18}, pad=16);

fig.tight_layout(pad=5)

plt.show()

小结:从上面右侧的热力图中能看到下面的字段和流失类型客户是无相关的。相关系数的值在正负0.1之间(右图)

  • Credit Limit
  • Average Open To Buy
  • Months On Book
  • Age
  • Dependent Count

现在我们考虑将上面的字段进行删除:

In [41]:

代码语言:javascript复制
df_model = df.copy()

df_model = df_model.drop(['Credit_Limit','Customer_Age','Avg_Open_To_Buy','Months_on_book','Dependent_count'],axis=1)

用户标识编码

In [42]:

代码语言:javascript复制
df_model['Attrition_Flag'] = df_model['Attrition_Flag'].map({'Existing Customer': 1, 'Attrited Customer': 0})

剩余字段的独热码:

代码语言:javascript复制
df_model=pd.get_dummies(df_model)

建模

切分数据

在之前已经验证过现有客户和流失客户的数量是不均衡的,我们使用SMOTE(Synthetic Minority Oversampling Technique,通过上采样合成少量的数据)采样来平衡数据。

In [50]:

代码语言:javascript复制
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

In [51]:

代码语言:javascript复制
# 特征和目标变量

# X = df_model.drop("Attrition_Flag", axis=1, inplace=True)
X = df_model.loc[:, df_model.columns != "Attrition_Flag"]
y = df_model["Attrition_Flag"]

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

SMOTE采样

In [52]:

代码语言:javascript复制
sm = SMOTE(sampling_strategy="minority", k_neighbors=20, random_state=42)

# 实施采样过程
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

3种模型

In [53]:

代码语言:javascript复制
# 1、随机森林

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_res, y_train_res)

Out[53]:

代码语言:javascript复制
RandomForestClassifier()

一般在使用树模型建模的时候数据不需要归一化。但是在使用支持向量机的时候需要:

In [54]:

代码语言:javascript复制
# 2、支持向量机

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# 使用支持向量机数据需要归一化

svm = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svm.fit(X_train_res, y_train_res)

Out[54]:

代码语言:javascript复制
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

In [55]:

代码语言:javascript复制
# 3、提升树

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=100,  # tree的个数
                                learning_rate=1.0,  # 学习率
                                max_depth=1,   # 叶子的最大深度
                                random_state=42)

gb.fit(X_train_res, y_train_res)

Out[55]:

代码语言:javascript复制
GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=42)

模型预测

In [56]:

代码语言:javascript复制
y_rf = rf.predict(X_test)
y_svm = svm.predict(X_test)
y_gb = gb.predict(X_test)

混淆矩阵

In [57]:

代码语言:javascript复制
from sklearn.metrics import plot_confusion_matrix

fig,ax=plt.subplots(ncols=3, figsize=(20,6))

plot_confusion_matrix(rf, X_test, y_test, ax=ax[0])
ax[0].title.set_text('RF')

plot_confusion_matrix(svm, X_test, y_test, ax=ax[1])
ax[1].title.set_text('SVM')

plot_confusion_matrix(gb, X_test, y_test, ax=ax[2])
ax[2].title.set_text('GB')
fig.tight_layout(pad=5)

plt.show()

分类模型得分

In [58]:

代码语言:javascript复制
# classification_report, recall_score, precision_score, f1_score

from sklearn.metrics import classification_report, recall_score, precision_score, f1_score

print('Random Forest Classifier')
print(classification_report(y_test, y_rf))

print('------------------------')
print('Support Vector Machine')
print(classification_report(y_test, y_svm))

print('------------------------')
print('Gradient Boosting')
print(classification_report(y_test, y_gb))
Random Forest Classifier
              precision    recall  f1-score   support

           0       0.85      0.83      0.84       541
           1       0.97      0.97      0.97      2801

    accuracy                           0.95      3342
   macro avg       0.91      0.90      0.90      3342
weighted avg       0.95      0.95      0.95      3342

------------------------
Support Vector Machine
              precision    recall  f1-score   support

           0       0.81      0.55      0.66       541
           1       0.92      0.98      0.95      2801

    accuracy                           0.91      3342
   macro avg       0.87      0.76      0.80      3342
weighted avg       0.90      0.91      0.90      3342

------------------------
Gradient Boosting
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       541
           1       0.97      0.97      0.97      2801

    accuracy                           0.95      3342
   macro avg       0.90      0.90      0.90      3342
weighted avg       0.95      0.95      0.95      3342

从3种模型的混淆矩阵和分类模型的相关评价指标来看:可以看到随机森林和提升树的结果都是优于支持向量机的

模型调参优化

针对随机森林和提升树模型采用两种不同的调参优化方法:

  • 随机森林:随机搜索调参
  • 梯度提升树:网格搜索调参

随机搜索调参-随机森林模型

In [59]:

代码语言:javascript复制
from sklearn.model_selection import RandomizedSearchCV

设置不同待调参数的取值:

In [60]:

代码语言:javascript复制
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# n_estimators  # 随机森林中树的个数

max_features = ['auto', 'sqrt']

In [61]:

代码语言:javascript复制
# 每个tree的最大叶子数
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

max_depth

Out[61]:

代码语言:javascript复制
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None]

In [62]:

代码语言:javascript复制
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
随机搜索参数

In [64]:

代码语言:javascript复制
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

搜索结果如下:

In [65]:

代码语言:javascript复制
rf_random = RandomizedSearchCV(
    estimator = rf,  # rf模型
    param_distributions=random_grid, # 搜索参数
    n_iter=30,
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1)

rf_random.fit(X_train_res, y_train_res)
print(rf_random.best_params_)
Fitting 3 folds for each of 30 candidates, totalling 90 fits
# 结果
{'n_estimators': 1400,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 110,
'bootstrap': True}
使用搜索参数建模

使用上面搜索之后的参数再次建模:

In [67]:

代码语言:javascript复制
rf_clf_search= RandomForestClassifier(n_estimators=1400,
                                   min_samples_split=2,
                                   min_samples_leaf=1,
                                   max_features='auto',
                                   max_depth=110,
                                   bootstrap=True)

rf_clf_search.fit(X_train_res,y_train_res)
y_rf_opt=rf_clf_search.predict(X_test)

print('Random Forest Classifier (Optimized)')

print(classification_report(y_test, y_rf_opt))

_rf_opt=plot_confusion_matrix(rf_clf_search, X_test, y_test)
Random Forest Classifier (Optimized)
              precision    recall  f1-score   support

           0       0.86      0.84      0.85       541
           1       0.97      0.97      0.97      2801

    accuracy                           0.95      3342
   macro avg       0.91      0.90      0.91      3342
weighted avg       0.95      0.95      0.95      3342

调参后的混淆矩阵:左上角的449变成452,说明分类的更加准确了

网格搜索调参-提升树模型

网格搜索参数

In [68]:

代码语言:javascript复制
from sklearn.model_selection import GridSearchCV

param_test1 = {'n_estimators':range(20,100,10)}
param_test1

Out[68]:

代码语言:javascript复制
{'n_estimators': range(20, 100, 10)}

In [69]:

代码语言:javascript复制
# 实施搜索

grid_search1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=1.0,  # 待搜索模型
                                                               min_samples_split=500,
                                                               min_samples_leaf=50,
                                                               max_depth=8,
                                                               max_features='sqrt',
                                                               subsample=0.8,
                                                               random_state=10),
                        param_grid = param_test1, # 搜索参数
                        scoring='roc_auc',
                        n_jobs=4,
                        cv=5)

grid_search1.fit(X_train_res,y_train_res)

grid_search1.best_params_

Out[69]:

代码语言:javascript复制
{'n_estimators': 90}
使用搜索参数建模

In [71]:

代码语言:javascript复制
gb_clf_opt=GradientBoostingClassifier(n_estimators=90,  # 搜索到的参数90
                                      learning_rate=1.0,
                                      min_samples_split=500,
                                      min_samples_leaf=50,
                                      max_depth=8,
                                      max_features='sqrt',
                                      subsample=0.8,
                                      random_state=10)
# 再次拟合
gb_clf_opt.fit(X_train_res,y_train_res)

y_gb_opt=gb_clf_opt.predict(X_test)
print('Gradient Boosting (Optimized)')
print(classification_report(y_test, y_gb_opt))

print(recall_score(y_test,y_gb_opt,pos_label=0))
_gbopt=plot_confusion_matrix(gb_clf_opt, X_test, y_test)
_gbopt

# 结果
Gradient Boosting (Optimized)
              precision    recall  f1-score   support

           0       0.85      0.84      0.85       541
           1       0.97      0.97      0.97      2801

    accuracy                           0.95      3342
   macro avg       0.91      0.91      0.91      3342
weighted avg       0.95      0.95      0.95      3342

0.8428835489833642

左上角的分类数目从454提升到456,也有一定的提升,但是效果并不是很明显

总结

本文从一份用户相关的数据出发,从数据预处理、特征工程和编码,到建模分析和调参优化,完成了整个用户流失预警的全流程分析。整体模型的结果准确率达到了95%,召回率也达到了84.2%。肯定还有提升的空间,欢迎一起讨论~

0 人点赞