Using dummy estimators to compare results使用虚拟估计值来对比结果

2020-05-07 14:12:44 浏览数 (3)

This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build.


Getting ready准备工作

In this recipe, we'll perform the following tasks:在这部分,我们将展现以下目标:

1. Create some data random data.生成一些随机数据集

2. Fit the various dummy estimators.拟合变量的虚拟估计值

We'll perform these two steps for regression data and classification data.我们将对回归数据和分类数据展示这两步。

How to do it...怎么做

First, we'll create the random data:首先生成些随机数据:

from sklearn.datasets import make_regression, make_classification
# classification if for late
X, y = make_regression()
from sklearn import dummy
dumdum = dummy.DummyRegressor(), y)
DummyRegressor(constant=None, quantile=None, strategy='mean')

By default, the estimator will predict by just taking the mean of the values and predicting the mean values:


array([10.39941733, 10.39941733, 10.39941733, 10.39941733, 10.39941733])

There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value.


Supplying a constant will only be considered if strategy is "constant".只有当策略是一个常数时,才考虑应用一个常数

Let's have a look:让我们看一下:

predictors = [("mean", None),("median", None),("constant", 10)]
for strategy, constant in predictors:
    dumdum = dummy.DummyRegressor(strategy=strategy,constant=constant), y)
    print("strategy: {}".format(strategy), ",".join(map(str,dumdum.predict(X)[:5])))
strategy: mean 10.399417325576314,10.399417325576314,10.399417325576314,10.399417325576314,10.399417325576314
strategy: median 7.262245710574845,7.262245710574845,7.262245710574845,7.262245710574845,7.262245710574845
strategy: constant 10,10,10,10,10

We actually have four options for classifiers. These strategies are similar to the continuous case,it's just slanted toward classification problems:


predictors = [("constant", 0),("stratified", None),("uniform", None),("most_frequent", None)]

We'll also need to create some classification data:我们也需要生成一些分类数据:

X, y = make_classification()
for strategy, constant in predictors:
    dumdum = dummy.DummyClassifier(strategy=strategy,constant=constant), y)
    print("strategy: {}".format(strategy), ",".join(map(str,dumdum.predict(X)[:5])))
strategy: constant 0,0,0,0,0
strategy: stratified 0,0,1,1,1
strategy: uniform 0,1,0,1,0
strategy: most_frequent 0,0,0,0,0

How it works...如何运行的

It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud.


We can create this model by using the stratified strategy, using the following command.We can also get a good example of why class imbalance causes problems:


X, y = make_classification(20000, weights=[.95, .05])
dumdum = dummy.DummyClassifier(strategy='most_frequent'), y)
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
from sklearn.metrics import accuracy_score
print(accuracy_score(y, dumdum.predict(X)))


We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. 我们实际上常常正确,但是不是因为这点,这个点其实是我们的基准线,如果我们不能生成一个比这个更准确的判别欺诈模型,那它就不值得我们所花费的时间。

0 人点赞