快速入门Python机器学习（19）

9.4 决策树回归（Decision Tree Regressor）

9.4.1类、属性和方法

类

代码语言：javascript复制

class sklearn.tree.DecisionTreeRegressor(*, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0)

参数

属性	类型	解释
max_depth	int, default=None	树的最大深度。如果没有，则节点将展开，直到所有叶都是纯的，或者直到所有叶都包含少于min_samples_split samples的值。
criterion	{'mse', 'friedman_mse', 'mae', 'poisson'}, default='mse'	他的职能是衡量分裂的质量。支持的标准是均方误差的'mse'，它等于作为特征选择标准的方差缩减，并使用每个终端节点的平均值最小化L2损失。'friedman_mse'，它使用均方误差和friedman的潜在分裂改善分数，'mae'表示平均绝对误差，它使用每个终端节点的中值最小化L1损失，而'poisson'则使用泊松偏差的减少来寻找分裂。

属性

属性	解释
feature_importances_	ndarray of shape (n_features,)返回功能重要性。
max_features_	intmax_features的推断值。
n_features_	int执行拟合时的特征数。
n_outputs_	int执行拟合时的输出数。
tree_	Tree instance基础树对象。请参阅帮助(sklearn.tree._tree.Tree)对于树对象的属性，了解决策树结构对于这些属性的基本用法。

方法

apply(X[, check_input])	返回每个样本预测为的叶的索引。
cost_complexity_pruning_path(X, y[, …])	在最小代价复杂度修剪过程中计算修剪路径。
decision_path(X[, check_input])	返回树中的决策路径。
fit(X, y[, sample_weight, check_input, …])	从训练集（X，y）建立一个决策树回归器。
get_depth()	返回决策树的深度。
get_n_leaves()	返回决策树的叶数。
get_params([deep])	获取此估计器的参数。
predict(X[, check_input])	预测X的类或回归值。
score(X, y[, sample_weight])	返回预测的确定系数R2。
set_params(**params)	设置此估计器的参数。

9.4.2分析有噪音make_regression数据

代码语言：javascript复制

def DecisionTreeRegressor_for_make_regression_add_noise():
       myutil = util()
       X,y = make_regression(n_samples=100,n_features=1,n_informative=2,noise=50,random_state=8)
       X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3)
       clf = DecisionTreeRegressor().fit(X,y)
       title = "make_regression DecisionTreeRegressor()回归线（有噪音）"
       myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
       myutil.draw_line(X[:,0],y,clf,title)
       myutil.plot_learning_curve(DecisionTreeRegressor(),X,y,title)
       myutil.show_pic(title)

输出

代码语言：javascript复制

make_regression DecisionTreeRegressor()回归线（有噪音）:
100.00%
make_regression DecisionTreeRegressor()回归线（有噪音）:
100.00%

结果相当好

9.4.3分析波士顿房价数据

代码语言：javascript复制

def DecisionTreeRegressor_for_boston():
       myutil = util()
       boston = datasets.load_boston()
       X,y = boston.data,boston.target
       X_train,X_test,y_train,y_test = train_test_split(X, y, random_state =8)
       for max_depth in [1,3,5,7]:
              clf = DecisionTreeRegressor(max_depth=max_depth)
              clf.fit(X_train,y_train)
              title=u"波士顿据测试集(max_depth=" str(max_depth) ")"
              myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
              myutil.plot_learning_curve(DecisionTreeRegressor(max_depth=max_depth),X,y,title)
              myutil.show_pic(title)

输出

代码语言：javascript复制

波士顿据测试集(max_depth=1):
45.95%
波士顿据测试集(max_depth=1):
35.44%
波士顿据测试集(max_depth=3):
83.84%
波士顿据测试集(max_depth=3):
62.87%
波士顿据测试集(max_depth=5):
93.82%
波士顿据测试集(max_depth=5):
69.38%
波士顿据测试集(max_depth=7):
97.31%
波士顿据测试集(max_depth=7):
79.19%

max_depth=7的时候效果最好，但是所有情况都存在过拟合现象

9.4.4分析糖尿病数据

代码语言：javascript复制

def DecisionTreeRegressor_for_diabetes():
       myutil = util()
       diabetes = datasets.load_diabetes()
       X,y = diabetes.data,diabetes.target
       X_train,X_test,y_train,y_test = train_test_split(X, y, random_state =8)
       for max_depth in [1,3,5,7]:
              clf = DecisionTreeRegressor(max_depth=max_depth)
              clf.fit(X_train,y_train)
              title=u"糖尿病据测试集(max_depth=" str(max_depth) ")"
              myutil.print_scores(clf,X_train,y_train,X_test,y_test,title)
              myutil.plot_learning_curve(DecisionTreeRegressor(max_depth=max_depth),X,y,title)
              myutil.show_pic(title)

输出

代码语言：javascript复制

糖尿病据测试集(max_depth=1):
30.44%
糖尿病据测试集(max_depth=1):
15.21%
糖尿病据测试集(max_depth=3):
55.64%
糖尿病据测试集(max_depth=3):
28.37%
糖尿病据测试集(max_depth=5):
71.81%
糖尿病据测试集(max_depth=5):
18.06%
糖尿病据测试集(max_depth=7):
84.30%
糖尿病据测试集(max_depth=7):
-1.26%

过拟合现象非常严重，特别是max_depth越大的时候。

9.5 决策树剪枝处理

不管是决策树分类还是决策树回归，过拟合现象是决策树算法的最大问题，但是从“9.4.2分析有噪音make_regression数据”可以看到，决策树还是一种非常有效的方法，解决过拟合现象有以下两种方法：

剪枝处理
随机森林

随机森林的属于集成学习的一类，我们将在下一章进行介绍。现在介绍一下剪枝。

预剪枝(Pre-pruning)：及早停止树的增长，也是sklearn中用的方法。
后剪枝(post-pruning)：先形成树，再剪枝。

代码语言：javascript复制

def decision_tree_pruning():
myutil = util()
cancer = datasets.load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)#stratify:分层
# 构件树，不剪枝
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train,y_train)
title = "不剪枝，训练数据集上的精度"
myutil.print_scores(tree,X_train,y_train,X_test,y_test,title)
print("不剪枝，树的深度:{}".format(tree.get_depth()))
# 构件树，剪枝
tree = DecisionTreeClassifier(max_depth=4,random_state=0)
tree.fit(X_train,y_train)
title = "剪枝，训练数据集上的精度"
myutil.print_scores(tree,X_train,y_train,X_test,y_test,title)
print("剪枝，树的深度:{}".format(tree.get_depth()))

输出

代码语言：javascript复制

不剪枝，训练数据集上的精度:
100.00%
不剪枝，训练数据集上的精度:
93.71%
不剪枝，树的深度:7
剪枝，训练数据集上的精度:
98.83%
剪枝，训练数据集上的精度:
95.10%
剪枝，树的深度:4

9.6决策树可视化

#pip3 install graphviz

# Graphviz 是一款由 AT&T Research 和 Lucent Bell 实验室开源的可视化图形工具

代码语言：javascript复制

from sklearn.tree import export_graphviz
import graphviz
def show_tree():
    wine = datasets.load_wine()
    # 仅选前两个特征
    X = wine.data[:,:2]
    y = wine.target
    X_train,X_test,y_train,y_test = train_test_split(X, y)
    clf = DecisionTreeClassifier(max_depth=3)#为了图片不太大选择max_depth=3
    clf.fit(X_train,y_train) export_graphviz(clf,out_file="wine.dot",class_names=wine.target_names,feature_names=wine.feature_names[:2],impurity=False,filled=True)
    #打开dot文件
    with open("wine.dot") as f:
        dot_graph = f.read()
    graphviz.Source(dot_graph)

安装graphviz软件，打开wine.dot

决策树机器学习神经网络深度学习人工智能

0 人点赞