If we use just the basic implementation of a Decision Tree, it will probably not fit very well.Therefore, we need to tweak the parameters in order to get a good fit. This is very easy and won't require much effort.
如果你只是执行基本的决策树,它拟合的可能并不好,因此,我们需要调整参数来优化它。很简单,而且不太费事。
Getting ready准备工作
In this recipe, we will take an in-depth look at what it takes to tune a Decision Tree classifier.There are several options, and in the previous recipe, we only looked at one of these options.We'll fit a basic model and actually look at what the Decision Tree looks like. Then,we'll re-examine after each decision and point out how various changes have influenced the structure.If you want to follow along in this recipe, you'll need to install pydot.
在这部分,我们将深入了解调试决策树分类器。这有几种参数,在之前的步骤中,我们只是先看了参数的其中之一。我们将拟合一个基本模型并看一下决策树原本是什么样的,然后,我们将在重新测试每个决策并且指出变量的改变对结果的影响。如果你想跟上这个步骤,你需要安装pydot。
How to do it…怎么做
Decision Trees have a lot more "knobs" when compared to most other algorithms, because of which it's easier to see what happens when we turn the knobs:
决策树与其他算法比较之下,是其有很多节点,因此很容易看出经过每个节点时发生了什么。
代码语言:javascript复制from sklearn import datasets
X, y = datasets.make_classification(1000, 20, n_informative=3)
from sklearn.tree import DecisionTreeClassifie
dt = DecisionTreeClassifier()
dt.fit(X, y)
Ok, so now that we have a basic classifier fit, we can view it quite simply:好的,我们现在用基本的分类器拟合,可视化它很简单:
代码语言:python代码运行次数:0复制from sklearn import tree
import graphviz
graph = graphviz.Source(tree.export_graphviz(dt))
graph.render('myfile.jpg', format='png',view='show')
'''
# StringIO 用不了了
from StringIO import StringIO
from sklearn import tree
import pydot
str_buffer = StringIO()
tree.export_graphviz(dt, out_file=str_buffer)
graph = pydot.graph_from_dot_data(str_buffer.getvalue())
graph.write("myfile.jpg")
'''
The graph is almost certainly illegible, but hopefully this illustrates the complex trees that can be generated as a result of using an unoptimized decision tree:
图片几乎难以辨识,但是还好这说明使用未优化的决策树的结果是它能生成复杂的树。
Wow! This is a very complex tree. It will most likely overfit the data. First, let's reduce the max depth value:
哇,这真是个复杂的树,它肯定会造成数据过拟合,首先,让我们减小最大深度的值:
代码语言:javascript复制dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X, y)
As an aside, if you're wondering why the semicolon, the repr by default, is seen, it is actually the model for a Decision Tree. For example, the fit function actually returns the Decision Tree object that allows chaining:
另一方面,如果你想知道为什么在repr定义的分号会被看到,它实际上就是决策树模型。比如,拟合函数实际上返回允许被链接的决策树对象。
代码语言:javascript复制dt = DecisionTreeClassifier(max_depth=5).fit(X, y)
Now, let's get back to the regularly scheduled program.As we will plot this a few times, let's create a function:
现在,我们回到程序的基本结构,由于我们会花点时间绘制,让我们生成个函数
代码语言:javascript复制graph = graphviz.Source(tree.export_graphviz(dt))
graph.render('myfile.png', format='png',view='show')
The following is the graph that will be generated:以下图形被生成:
This is a much simpler tree. Let's look at what happens when we use entropy as the splitting criteria:
这是个很简单的树,让我们看看当我们使用熵作为分隔条件,发生了什么:
代码语言:javascript复制dt = DecisionTreeClassifier(criterion='entropy',max_depth=5).fit(X, y)
graph = graphviz.Source(tree.export_graphviz(dt))
graph.render('entropy.png', format='png',view='show')
The following is the graph that can be generated:生成如下图形:
It's good to see that the first two splits are the same features, and the first few after this are interspersed with similar amounts. This is a good sanity check.Also, note how entropy for the first split is 0.999, but for the first split when using the Gini impurity is 0.5. This has to do with how different the two measures of the split of a Decision Tree are. See the following How it works... section for more information. However, if we want to create a Decision Tree with entropy, we must use the following command:
看到前两个分隔特征相同是件好事,随后散布着近似数量的特征。这是很正确的检查。注意熵是如何区分第一个分类为0.999,但是使用基尼不纯度却是0.5.为何对决策树使用两种参数分类方法产生不同结果必须被注意。它怎么运行的?请往下看会有更多信息。无论怎样,如果我们想生成基于熵的决策树模型,我们都要执行以下代码。
代码语言:javascript复制dt = DecisionTreeClassifier(min_samples_leaf=10,criterion='entropy',max_depth=5).fit(X, y)
How it works…怎么运行的:
Decision Trees, in general, suffer from overfitting. Quite often, left to it's own devices, a Decision Tree model will overfit, and therefore, we need to think about how best to avoid overfitting; this is done to avoid complexity. A simple model will more often work better in practice than not.We're about to see this very idea in practice. random forests will build on this idea of simple models.
决策树总体来说会遇到过拟合,通常决策树自己的策略会导致过拟合,因此,我们需要考虑如何更好的避免过拟合,避免复杂性是一个办法,一个简单的模型在训练中表现的会更好,我们将在训练中看到关于这个的思想。即将建立的随机森林也是这种简单模型的思想。