Using many Decision Trees – random forests使用多棵决策树--随机森林

2020-04-29 11:39:20 浏览数 (1)

In this recipe, we'll use random forests for classification tasks. random forests are used because they're very robust to overfitting and perform well in a variety of situations.


Getting ready准备工作

We'll explore this more in the How it works... section of this recipe, but random forests work by constructing a lot of very shallow trees, and then taking a vote of the class that each tree "voted" for. This idea is very powerful in machine learning. If we recognize that a simple trained classifier might only be 60 percent accurate, we can train lots of classifiers that are generally right and can then use the learners together.

我们将在“how it works”部分探索更多,但是随机森林通过构筑大量的浅层树来运行,然后每棵树对分类进行投票。这个思想在机器学习中非常强大,如果我们意识到简单的训练模型可能仅有60%的正确率,我们能够通过训练大量大致正确的分类器,达到组合后可用的情况。

How to do it…怎么做:

The mechanics of training a random forest classifier is very easy with scikit-learn. In this section,we'll do the following:


1. Create a sample dataset to practice with.

2. Train a basic random forest object.

3. Take a look at some of the attributes of a trained object.




In the next recipe, we'll look at how to tune the random forest classifier. Let's start by importing datasets:


from sklearn import datasets

Then, create the dataset with 1,000 samples:然后,生成1000个样本的数据集:

X, y = datasets.make_classification(1000)

Now that we have the data, we can create a classifier object and train it:现在我们有了数据,生成分类器对象并训练它:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(), y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

The first thing we want to do is see how well we fit the training data. We can use the predict method for these projections:


print "Accuracy:t", (y == rf.predict(X)).mean()
Accuracy: 0.998
print "Total Correct:t", (y == rf.predict(X)).sum()
Total Correct: 998

Now, let's look at some attributes and methods.现在,让我们看看属性和方法。

First, we'll look at some of the useful attributes; in this case, since we used defaults, they'll be the object defaults:


1、 rf.criterion : This is the criterion for how the splits are determined. The default is gini .

2、 rf.bootstrap : A Boolean that indicates whether we used bootstrap samples when training random forest.

3、 rf.n_jobs : The number of jobs to train and predict. If you want to use all the processors, set this to -1 . Keep in mind that if your dataset isn't very big, it often leads to more overhead in using multiple jobs due to the data having to be serialized and moved in between processes.

4、 rf.max_features : This denotes the number of features to consider when making the best split. This will come in handy during the tuning process.

5、 rf.compute_importances : This helps us decide whether to compute the importance of the features. See the There's more... section of this recipe for information on how to use this.

6、 rf.max_depth : This denotes how deep each tree can go.







There are more attributes to note; check out the official documentation for more details.The predict method isn't the only useful one. We can also get the probabilities of each class from individual samples. This can be a useful feature to understand the uncertainty in each prediction. For instance, we can predict the probabilities of each sample for the various classes:


probs = rf.predict_proba(X)
import pandas as pd
probs_df = pd.DataFrame(probs, columns=['0', '1'])
probs_df['was_correct'] = rf.predict(X) == y
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar', ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")

The following is the output:如下图所示:

How it works…怎么运行的

Random forest works by using a predetermined number of weak Decision Trees and by training each one of these trees on a subset of data. This is critical in avoiding overfitting. This is also the reason for the bootstrap parameter. We have each tree trained with the following:


1、 The class with the most votes

2、 The output, if we use regression trees There are, of course, performance considerations, which we'll cover in the next recipe, but for the purposes of understanding how random forests work, we train a bunch of average trees and get a fairly good classifier as a result.



There's more…扩展阅读

Feature importance is a good by-product of random forests. This often helps to answer the question: If we have 10 features, which features are most important in determining the true class of the data point? The real-world applications are hopefully easy to see. For example,if a transaction is fraudulent, we probably want to know if there are certain signals that can be used to figure out a transaction's class more quickly.


If we want to calculate the feature importance, we need to state it when we create the object.If you use scikit-learn 0.15, you might get a warning that it is not required; in Version 0.16,the warning will be removed:

如果我们想要计算特征的权重,我们需要在创建对象的时候就做说明。如果你使用scikit-learn0.15版,你会得到个警告,因为这不是必须的,在0.16以后的版本,警告被删除了, 已经没有compute_importances=True这个参数

rf = RandomForestClassifier(), y)
f, ax = plt.subplots(figsize=(7, 5)),rf.feature_importances_)
ax.set_title("Feature Importances")

The following is the output:输出如下

As we can see, certain features are much more important than others when determining if the outcome was of class 0 or class 1.


0 人点赞