Using many Decision Trees – random forests使用多棵决策树--随机森林

2020-04-29 11:39:20 浏览数 (1)

In this recipe, we'll use random forests for classification tasks. random forests are used because they're very robust to overfitting and perform well in a variety of situations.

在这部分,我们将使用随机森林来完成分类任务。随机森林由于对过拟合的稳健性和在众多情形下表现较好受青睐。

Getting ready准备工作

We'll explore this more in the How it works... section of this recipe, but random forests work by constructing a lot of very shallow trees, and then taking a vote of the class that each tree "voted" for. This idea is very powerful in machine learning. If we recognize that a simple trained classifier might only be 60 percent accurate, we can train lots of classifiers that are generally right and can then use the learners together.

我们将在“how it works”部分探索更多,但是随机森林通过构筑大量的浅层树来运行,然后每棵树对分类进行投票。这个思想在机器学习中非常强大,如果我们意识到简单的训练模型可能仅有60%的正确率,我们能够通过训练大量大致正确的分类器,达到组合后可用的情况。

How to do it…怎么做:

The mechanics of training a random forest classifier is very easy with scikit-learn. In this section,we'll do the following:

在scikit-learn训练机械性随机森林分类器非常简单,在这部分,我们将要按如下方法:

1. Create a sample dataset to practice with.

2. Train a basic random forest object.

3. Take a look at some of the attributes of a trained object.

1、生成用于练习的样本数据集。

2、训练一个基本的随机森林对象

3、观察训练对象的属性。

In the next recipe, we'll look at how to tune the random forest classifier. Let's start by importing datasets:

在下一步,我们将观察如何调试随机森林分类器,让我们从导入数据集开始

代码语言:javascript复制
from sklearn import datasets

Then, create the dataset with 1,000 samples:然后,生成1000个样本的数据集:

代码语言:javascript复制
X, y = datasets.make_classification(1000)

Now that we have the data, we can create a classifier object and train it:现在我们有了数据,生成分类器对象并训练它:

代码语言:javascript复制
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

The first thing we want to do is see how well we fit the training data. We can use the predict method for these projections:

首要的是看一下我们拟合的训练数据效果怎么样。我们能使用预测方法来推断

代码语言:javascript复制
print "Accuracy:t", (y == rf.predict(X)).mean()
Accuracy: 0.998
print "Total Correct:t", (y == rf.predict(X)).sum()
Total Correct: 998

Now, let's look at some attributes and methods.现在,让我们看看属性和方法。

First, we'll look at some of the useful attributes; in this case, since we used defaults, they'll be the object defaults:

首先,我们看一些有用的属性,在这个例子,我们使用的默认值,就是默认参数:

1、 rf.criterion : This is the criterion for how the splits are determined. The default is gini .

2、 rf.bootstrap : A Boolean that indicates whether we used bootstrap samples when training random forest.

3、 rf.n_jobs : The number of jobs to train and predict. If you want to use all the processors, set this to -1 . Keep in mind that if your dataset isn't very big, it often leads to more overhead in using multiple jobs due to the data having to be serialized and moved in between processes.

4、 rf.max_features : This denotes the number of features to consider when making the best split. This will come in handy during the tuning process.

5、 rf.compute_importances : This helps us decide whether to compute the importance of the features. See the There's more... section of this recipe for information on how to use this.

6、 rf.max_depth : This denotes how deep each tree can go.

1、rf.criterion:这是决定如何分割的原则,默认是gini

2、rf.bootstrap:这是布尔值来定义当训练随机森林时是否使用自助法(解决样本分布非正态问题)。

3、rf.n_jobs:训练和预测的运行次数。如果你想使用所有的处理器,设置它为-1.牢记如果你的数据集不是特别大,使用过多的次数将导致巨大开支,因为数据在过程中需要被序列化和删除。

4、rf.max_features:这表示使用最好的分割时考虑的特征数量。这将在调试过程中派上用场。

5、rf.compute_importances:这将帮助我们决定是否计算特征的权重。看看扩展阅读部分能获取更多如何使用它的信息。

6、rf.max_depth:这决定每棵树有多深。

There are more attributes to note; check out the official documentation for more details.The predict method isn't the only useful one. We can also get the probabilities of each class from individual samples. This can be a useful feature to understand the uncertainty in each prediction. For instance, we can predict the probabilities of each sample for the various classes:

这里有很多需要注意的特征,查看官方文档来获取更多细节。预测方法并不是唯一有效的方法。我们也可以使用独立样本中每个分类的概率。了解每个预测过程中的不确定性也是个有用的特征。例如,我们能预测每个样本对于不同分类的概率。

代码语言:javascript复制
probs = rf.predict_proba(X)
import pandas as pd
probs_df = pd.DataFrame(probs, columns=['0', '1'])
probs_df['was_correct'] = rf.predict(X) == y
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar', ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")

The following is the output:如下图所示:

How it works…怎么运行的

Random forest works by using a predetermined number of weak Decision Trees and by training each one of these trees on a subset of data. This is critical in avoiding overfitting. This is also the reason for the bootstrap parameter. We have each tree trained with the following:

随机森林通过预先定义弱决策树的数量并通过在部分数据上训练每棵树来运行。这关键是可以避免过拟合,这也是使用独立参数的原因。以下是对每棵树的训练:

1、 The class with the most votes

2、 The output, if we use regression trees There are, of course, performance considerations, which we'll cover in the next recipe, but for the purposes of understanding how random forests work, we train a bunch of average trees and get a fairly good classifier as a result.

1、得票最多的类

2、如果这里我们用回归树得到的输出,当然,将在下部分考虑其表现,但是为了理解随机森林如何工作这个目标,我们训练一组平均树,并得到一个较为公平的分类器。

There's more…扩展阅读

Feature importance is a good by-product of random forests. This often helps to answer the question: If we have 10 features, which features are most important in determining the true class of the data point? The real-world applications are hopefully easy to see. For example,if a transaction is fraudulent, we probably want to know if there are certain signals that can be used to figure out a transaction's class more quickly.

特征权重是随机森林一个很好的副产品。这经常对以下问题有帮助:如果我们有10个特征,哪一个特征是最重要的、能决定数据点真实分类的呢?在现实世界的应用希望它能简单的看出来,例如:如果一笔交易是欺诈的,我们可能希望知道它是否有确定的标志能快速的用于指出这个交易的类型。

If we want to calculate the feature importance, we need to state it when we create the object.If you use scikit-learn 0.15, you might get a warning that it is not required; in Version 0.16,the warning will be removed:

如果我们想要计算特征的权重,我们需要在创建对象的时候就做说明。如果你使用scikit-learn0.15版,你会得到个警告,因为这不是必须的,在0.16以后的版本,警告被删除了, 已经没有compute_importances=True这个参数

代码语言:javascript复制
rf = RandomForestClassifier()
rf.fit(X, y)
f, ax = plt.subplots(figsize=(7, 5))
ax.bar(range(len(rf.feature_importances_)),rf.feature_importances_)
ax.set_title("Feature Importances")

The following is the output:输出如下

As we can see, certain features are much more important than others when determining if the outcome was of class 0 or class 1.

如我所见,在做分类为0或1的决定的过程中,确定特征较其他的相比更加重要。

0 人点赞