In this recipe, we will perform basic classifications using Decision Trees. These are very nice models because they are easily understandable, and once trained in, scoring is very simple.
在这部分,我们将要使用决策树展现基本的分类算法。它是非常好用的模型,因为其易于理解,一旦训练好以后,非常易于使用。
Often, SQL statements can be used, which means that the outcome can be used by a lot of people.
经常被用于SQL表格形式,这意味着输出能够被很多人使用。
Getting ready准备工作
In this recipe, we'll look at Decision Trees. I like to think of Decision Trees as the base class from which a large number of other classification methods are derived. It's a pretty simple idea that works well in a bunch of situations.
在这部分,我们看下决策树,我喜欢把决策树作为大量分类算法的一个基本类型来讨论。它是一个非常简单的方法,并且在大量场景下都运行的很好。
First, let's get some classification data that we can practice on:首先,我们得到一些能用实践的分类数据:
代码语言:javascript复制from sklearn import datasets
X, y = datasets.make_classification(n_samples=1000, n_features=3,n_redundant=0)
How to do it…怎么做
Working with Decision Trees is easy. We first need to import the object, and then fit the model:
运行决策树非常简单,我们首先需要导入对象,然后拟合模型:
代码语言:javascript复制from sklearn.tree import DecisionTreeClassifie
dt = DecisionTreeClassifier()
dt.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
preds = dt.predict(X)
(y == preds).mean()
1.0
As you can see, we guessed it right. Clearly, this was just a dry run, now let's investigate some of our options.
如你所见,我们认为它就是对的,这只是个演习,让我们来加入些自己的观点:
First, if you look at the dt object, it has several keyword arguments that determine how the object will behave. How we choose the object is important, so we'll look at the object's effects in detail.
首先,如果我们看着dt对象,他有几个关键参数来决定对象的行为。我们如何选择对象非常重要,所以我们来看看这些对象影响的细节。
The first detail we'll look at is max_depth . This is an important parameter. It determines how many branches are allowed. This is important because a Decision Tree can have a hard time generalizing out-of-sampled data with some sort of regularization. Later, we'll see how we can use several shallow Decision Trees to make a better learner. Let's create a more complex dataset and see what happens when we allow different max_depth . We'll use this dataset for the rest of the recipe:
我们将关注的第一个细节是 max_depth,这是个很重要的参数,它决定模型被允许多少个分支。之所以重要的原因是决策树很难概括没有正则化排序后的样本数据。稍后,我们将会看到我们能用几个浅的决策树来生成更好的学习器。让我们生成一个更复杂的数据集,来看一下当我们选择不同的max_depth,会发生什么。在往后的部分,我们都要使用这个数据集。
代码语言:javascript复制n_features=200
X, y = datasets.make_classification(750, n_features,n_informative=5)
import numpy as np
training = np.random.choice([True, False], p=[.75, .25],size=len(y))
accuracies = []
for x in np.arange(1, n_features 1):
dt = DecisionTreeClassifier(max_depth=x)
dt.fit(X[training], y[training])
preds = dt.predict(X[~training])
accuracies.append((preds == y[~training]).mean())
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(range(1, n_features 1), accuracies, color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_ylabel("% Correct")
ax.set_xlabel("Max Depth")
plt.show()
The following is the output:以下为图形输出
We can see that we actually get pretty accurate at a low max depth. Let's take a closer look at the accuracy at low levels, say the first 15:
我们能看到在较低的最大深度情况下,我们实际上得到很好的准确性,让我们进一步看一下在较低水平的准确性,先看15时:
代码语言:javascript复制N = 15
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(range(1, n_features 1)[:N], accuracies[:N], color='k')
ax.set_title("Decision Tree Accuracy")
ax.set_ylabel("% Correct")
ax.set_xlabel("Max Depth")
The following is the output:以下为图形输出
There's the spike we saw earlier; it's quite amazing to see the quick drop though. It's more likely that Max Depth of 1 through 3 is fairly equivalent. Decision Trees are quite good at separating rules, but they need to be reigned in.
最先看到尖顶的部分,很惊奇的看到它快速的下降了。看起来1到3层差不多,决策树非常善于分类规则,但是它需要进行调整。
We'll look at the compute_importances parameter here. It actually has a bit of a broade meaning for random forests, but we'll get acquainted with it. It's also worth noting that if you're using Version 0.16 or earlier, you will get this for free:
#plot the importances 画出权重
代码语言:javascript复制ne0 = dt.feature_importances_ != 0
y_comp = dt.feature_importances_[ne0]
x_comp = np.arange(len(dt.feature_importances_))[ne0]
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
ax.bar(x_comp, y_comp)
The following is the output:以下为图形输出
【Please note that you may get an error letting you know you'll no longer need to explicitly set compute importances.】
【注意你可能得到一个误差值使你觉得你不需要计算权重的明确的集合】
As we can see, one of the features is by far the most important; several other features will follow up.
如我们所见,有一个特征有很高的权重,其他几个特征紧随其后。
How it works…怎么做的
In the simplest sense, we construct Decision Trees all the time. When thinking through situations and assigning probabilities to outcomes, we construct Decision Trees. Our rules are much more complex and involve a lot of context, but with Decision Trees, all we care about is the difference between outcomes, given that some information is already known about a feature.
在最简单的情况下,我们持续构建决策树。当考虑输出的现实情况和概率分布,我们构建决策树,我们规定很多复杂的包含很多内容,但是使用决策树,我们所有关心的点都在输出之间,一个特征将给出一些已知信息。
Now, let's discuss the differences between entropy and Gini impurity.现在,让我们讨论下熵和基尼不纯度
Entropy is more than just the entropy value at any given variable; it states what the change in entropy is if we know an element's value. This is called Information Gain (IG); mathematically it looks like the following:
熵不仅给出某些变量熵的值,它表示当我们知道元素的值,熵在其不同状态时的变化。这叫做信息增益,它的表达式如下所示:
For Gini impurity, we care about how likely one of the data points will be mislabeled given the new information.Both entropy and Gini impurity have pros and cons; this said, if you see major differences in the working of entropy and Gini impurity, it will probably be a good idea to re-examine your assumptions.
对于基尼不纯度,我们关心给出的一个新信息的数据点会给出的错误率。熵和基尼不纯度都有利有弊,这说明,如果你看到熵和基尼不纯度在工作中的重大分歧,可能最好的办法是重新测试你的假设