快速入门Python机器学习（20）

10 集成学习

10.1随机森林算法(Random Forest)

10.1.1概念

2001年Breiman把分类树组合成随机森林(Breiman 2001a)，即在变量(列)的使用和数据(行)的使用上进行随机化，生成很多分类树，再汇总分类树的结果。随机森林在运算量没有显著提高的前提下提高了预测精度。

算法流程：

构建决策树的个数t，单颗决策树的特征个数f，m个样本，n个特征数据集

1 单颗决策树训练

1.1 采用有放回抽样，从原数据集经过m次抽样，获得有m个样本的数据集(可能有重复样本)

1.2 从n个特征里，采用无放回抽样原则，去除f个特征作为输入特征

1.3 在新的数据集(m个样本， f个特征数据集上)构建决策树

1.4 重复上述过程t次，构建t棵决策树

2 随机森林的预测结果

生成t棵决策树，对于每个新的测试样例，综合多棵决策树预测的结果作为随机森林的预测结果。

回归问题：取t棵决策树预测值的平均值作为随机森林预测结果

分类问题：少数服从多数的原则，取单棵的分类结果作为类别随机森林预测结果

Sklearn中RandomForestClassifier和RandomForestRegressor分类和回归树算法

10.1.2 随机森林分类法

类参数、属性和方法

类

代码语言：javascript复制

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

参数

参数	类型	解释
n_estimators	int, default=100	森林中树木的数量。
random_state	RandomState instance or None, default=None	控制生成树时使用的样本引导的随机性（如果bootstrap=True）和在每个节点上查找最佳分割时要考虑的特征的采样（如果max_features < n_features）。

属性

属性	解释
base_estimator_	DecisionTreeClassifier用于创建拟合子估计器集合的子估计器模板。
estimators_	list of DecisionTreeClassifier拟合子估计量的集合。
classes_	ndarray of shape (n_classes,) or a list of such arrays形状数组（n个类）或此类数组的列表类标签（单输出问题），或类标签数组的列表（多输出问题）。
n_classes_	int or list类数（单输出问题），或包含每个输出的类数的列表（多输出问题）。
n_features_	int执行拟合时的特征数。
n_outputs_	int执行拟合时的输出数。
feature_importances_	ndarray of shape (n_features,)基于杂质的特征非常重要。
oob_score_	float使用现成的估计值获得的训练数据集的得分。只有当oob_score为True时，此属性才存在。
oob_decision_function_	ndarray of shape (n_samples, n_classes)利用训练集上的包外估计计算决策函数。如果nèu估计量很小，则可能在引导过程中从未遗漏数据点。在这种情况下，oob_decision_function_可能包含NaN。只有当oob_score为True时，此属性才存在。

方法

apply(X)	将森林中的树应用到X，返回叶子数。
decision_path(X)	返回林中的决策路径。
fit(X, y[, sample_weight])	从训练集（X，y）建立一个树的森林。
get_params([deep])	获取此估计器的参数。
predict(X)	预测X的类。
predict_log_proba(X)	预测X的类对数概率。
predict_proba(X)	预测X的类概率。
score(X, y[, sample_weight])	返回给定测试数据和标签的平均精度。
set_params(**params)	设置此估计器的参数。

随机森林分类参数

代码语言：javascript复制

ef base_of_decision_tree_forest(n_estimator,random_state,X,y,title):
    myutil = util()
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    clf = RandomForestClassifier(n_estimators=n_estimator, random_state=random_state,n_jobs=2)#n_jobs:设置为CPU个数
    # 在训练数据集上进行学习
    clf.fit(X_train, y_train)
    cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF’])
    cmap_bold =  ListedColormap(['#FF0000','#00FF00','#0000FF’])
    #分别将样本的两个特征值创建图像的横轴和纵轴
    x_min,x_max = X_train[:,0].min()-1,X_train[:,0].max() 1
    y_min,y_max = X_train[:,1].min()-1,X_train[:,1].max() 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                      np.arange(y_min, y_max, .02))
    #给每个样本分配不同的颜色
    Z = clf.predict(np.c_[xx.ravel(),yy.ravel()])
    Z = Z.reshape(xx.shape)
plt.pcolormesh(xx,yy,Z,cmap=cmap_light,shading='auto')


       #用散点把样本表示出来
       plt.scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold,s=20,edgecolors='k')
       plt.xlim(xx.min(),xx.max()) 
       plt.ylim(yy.min(),yy.max())
       title = title "数据随机森林训练集得分(n_estimators:" str(n_estimator) ",random_state:" str(random_state) ")"
       myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)


def tree_forest():
       myutil = util()
       title = ["鸢尾花","红酒","乳腺癌"]
       j = 0
       for datas in [datasets.load_iris(),datasets.load_wine(),datasets.load_breast_cancer()]:
              #定义图像中分区的颜色和散点的颜色
              figure,axes = plt.subplots(4,4,figsize =(100,10))
              plt.subplots_adjust(hspace=0.95)
              i = 0
              # 仅选前两个特征
              X = datas.data[:,:2]
              y = datas.target
              mytitle =title[j]
              for n_estimator in range(4,8):
                     for random_state in range(2,6):
                            plt.subplot(4,4,i 1)
                            plt.title("n_estimator:" str(n_estimator) "random_state:" str(random_state))
                            plt.suptitle(u"随机森林分类")
                            base_of_decision_tree_forest(n_estimator,random_state,X,y,mytitle)
                            i = i   1
              myutil.show_pic(mytitle)
              j = j 1

鸢尾花

		n_estimators
4		5		6		7
训练集	测试集	训练集	测试集	训练集	测试集	训练集	测试集
random_state	2	89.29%	71.05%	91.07%	71.05%	91.07%	78.95%	91.96%	76.32%
	4	91.07%	68.42%	89.29%	76.32%	94.64%	60.53%	91.07%	76.32%
	4	91.07%	71.05%	93.75%	78.95%	93.75%	68.42%	91.07%	71.05%
	5	90.18%	68.42%	92.86%	78.95%	91.96%	60.53%	92.86%	76.32%

红酒

		n_estimators
4		5		8		7
训练集	测试集	训练集	测试集	训练集	测试集	训练集	测试集
random_state	2	96.24%	80.00%	98.50%	68.89%	96.99%	82.22%	97.74%	82.22%
	3	92.48%	86.67%	97.74%	80.00%	96.24%	73.33%	98.50%	68.89%
	4	95.49%	75.56%	93.98%	84.44%	96.99%	84.44%	98.50%	91.11%
	5	96.24%	68.89%	96.99%	82.22%	96.99%	71.11%	98.50%	75.56%

乳腺癌

		n_estimators
4		5		6		7
训练集	测试集	训练集	测试集	训练集	测试集	训练集	测试集
random_state	2	97.18%	81.82%	97.65%	90.91%	98.83%	89.51%	99.30%	85.31%
	3	97.89%	85.31%	97.65%	80.42%	97.89%	86.71%	98.59%	84.62%
	4	97.42%	82.52%	97.89%	88.11%	97.42%	89.51%	97.89%	90.91%
	5	97.18%	87.41%	98.83%	88.11%	98.12%	88.81%	98.83%	88.81%

代码语言：javascript复制

import mglearn
def my_RandomForet():
  # 生成一个用于模拟的二维数据集
  X, y = datasets.make_moons(n_samples=100, noise=0.25, random_state=3)
  # 训练集和测试集的划分
  X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,random_state=42)
  # 初始化一个包含 5 棵决策树的随机森林分类器
  forest = RandomForestClassifier(n_estimators=5, random_state=2)
  # 在训练数据集上进行学习
  forest.fit(X_train, y_train)
  # 可视化每棵决策树的决策边界
  fig, axes = plt.subplots(2, 3, figsize=(20, 10))
  for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
      ax.set_title('Tree {}'.format(i))
      mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)
      print("决策树" str(i) "训练集得分:{:.2%}".format(tree.score(X_train,y_train)))
      print("决策树" str(i) "测试集得分:{:.2%}".format(tree.score(X_test,y_test)))       
        # 可视化集成分类器的决策边界
      print("随机森林训练集得分:{:.2%}".format(forest.score(X_train,y_train)))
      print("随机森林测试集得分:{:.2%}".format(forest.score(X_test,y_test)))
      mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],alpha=0.4)
      axes[-1, -1].set_title('Random Forest')
      mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
      plt.show()

输出

代码语言：javascript复制

决策树0训练集得分:89.33%
决策树0测试集得分:84.00%
决策树1训练集得分:96.00%
决策树1测试集得分:88.00%
决策树2训练集得分:97.33%
决策树2测试集得分:80.00%
决策树3训练集得分:89.33%
决策树3测试集得分:92.00%
决策树4训练集得分:92.00%
决策树4测试集得分:88.00%
随机森林训练集得分:96.00%
随机森林测试集得分:92.00%

虽然决策树3不存在过拟合，决策树4的差值与随机森林得分一致，但是随机森林得分比他们都要高。

	训练集	测试集	差值
随机森林	96.00%	92.00%	4
决策树0	89.33%	84.00%	5
决策树1	96.00%	88.00%	8
决策树2	97.33%	80.00%	17
决策树3	89.33%	92.00%	-3
决策树4	92.00%	88.00%	4

随机森林分类参数散点图分析实例

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/,下载adult.dat文件，改为adult.csv

代码语言：javascript复制

import pandas as pd
def income_forecast():
        data=pd.read_csv('adult.csv', header=None,index_col=False,
                  names=['年龄','单位性质','权重','学历','受教育时长',
                       '婚姻状况','职业','家庭情况','种族','性别',
                       '资产所得','资产损失','周工作时长','原籍',
                       '收入'])
        #为了方便展示，我们选取其中一部分数据
        data_title = data[['年龄','单位性质','学历','性别','周工作时长','职业','收入']]
        print(data_title.head())
        #利用shape方法获取数据集的大小
        data_title.shape
        print("data_title.shape:n",data_title.shape)
        data_title.info()

单位性质、学历、性别、职业、收入均不是数值类型

输出

年龄单位性质学历 ... 周工作时长职业收入

0 39 State-gov Bachelors ... 40 Adm-clerical <=50K

1 50 Self-emp-not-inc Bachelors ... 13 Exec-managerial <=50K

2 38 Private HS-grad ... 40 Handlers-cleaners <=50K

3 53 Private 11th ... 40 Handlers-cleaners <=50K

4 28 Private Bachelors ... 40 Prof-specialty <=50K

data_title.shape:

(32561, 7)

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 年龄 32561 non-null int64

1 单位性质 32561 non-null object

2 学历 32561 non-null object

3 性别 32561 non-null object

4 周工作时长 32561 non-null int64

5 职业 32561 non-null object

6 收入 32561 non-null object

dtypes: int64(2), object(5)

memory usage: 1.7 MB

代码语言：javascript复制

##1-数据准备
#1.2 数据预处理
#用get_dummies将文本数据转化为数值
data_dummies=pd.get_dummies(data_title)
print("data_dummies.shape:n",data_dummies.shape)
#对比样本原始特征和虚拟变量特征---df.columns获取表头
print('样本原始特征:n',list(data_title.columns),'n')
print('虚拟变量特征:n',list(data_dummies.columns))
##2-数据建模---拆分数据集/模型训练/测试
#2.1将数据拆分为训练集和测试集---要用train_test_split模块中的train_test_split()函数，随机将75%数据化为训练集，25%数据为测试集
#导入数据集拆分工具  
#拆分数据集---x,y都要拆分，rain_test_split(x,y,random_state=0),random_state=0使得每次生成的伪随机数不同
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
#查看拆分后的数据集大小情况
print('x_train_shape:{}'.format(x_train.shape))
print('x_test_shape:{}'.format(x_test.shape))
print('y_train_shape:{}'.format(y_train.shape))
print('y_test_shape:{}'.format(y_test.shape))
##2、数据建模---模型训练/测试---决策树算法
#2.2 模型训练---算法.fit(x_train,y_train)
#使用算法
tree = DecisionTreeClassifier(max_depth=5)#这里参数max_depth最大深度设置为5
#算法.fit(x,y)对训练数据进行拟合
tree.fit(x_train, y_train)
##2、数据建模---拆分数据集/模型训练/测试---决策树算法
#2.3 模型测试---算法.score(x_test,y_test)
score_test=tree.score(x_test,y_test)
score_train=tree.score(x_train,y_train)
print('test_score:{:.2%}'.format(score_test))
print('train_score:{:.2%}'.format(score_train))
##3、模型应用---算法.predict(x_new)---决策树算法
#导入要预测数据--可以输入新的数据点，也可以随便取原数据集中某一数据点，但是注意要与原数据结构相同
x_new=[[37,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]]
#37岁，机关工作，硕士，男，每周工作40小时，文员
prediction=tree.predict(x_new)
print('预测数据:{}'.format(x_new))
print('预测结果:{}'.format(prediction))

输出

代码语言：javascript复制

特征形态:(32561, 44) 标签形态:(32561,)
x_train_shape:(24420, 44)
x_test_shape:(8141, 44)
y_train_shape:(24420,)
y_test_shape:(8141,)
test_score:79.62%
train_score:80.34%
预测数据:[[37, 40, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
预测结果:[0]

决策树机器学习神经网络深度学习人工智能

0 人点赞