使用Pipelines来整合多个数据预处理步骤

2020-04-20 10:14:41 浏览数 (1)

Using Pipelines for multiple preprocessing steps

Pipelines are (at least to me) something I don't think about using often, but are useful.They can be used to tie together many steps into one object. This allows for easier tuning and better access to the configuration of the entire model, not just one of the steps.

Pipelines是一个我认为使用不广泛,但是很有用的方法,他可以把很多步骤联系在一个项目里,使他能够简单的转换和更好的适应数据的整体结构,而不仅仅是一个步骤。

Getting ready准备

This is the first section where we'll combine multiple data processing steps into a single step.In scikit-learn, this is known as a Pipeline. In this section, we'll first deal with missing data via imputation; however, after that, we'll scale the data to get a mean of zero and a standard deviation of one.Let's create a dataset that is missing some values, and then we'll look at how to create a Pipeline:

这是我们开始结合多项数据预处理步骤为一部的第一章节,在scikit-learn中,它被称为一个Pipeline,在这一节,我们首先处理缺失值填充,然后我们放缩数据成均值为0,标准差为1的形式,让我们先生成一个含有缺失值的数据集,然后我们来学习如何创建一个Pipeline:

代码语言:javascript复制
from sklearn import datasets
import numpy as np
mat = datasets.make_spd_matrix(10)
masking_array = np.random.binomial(1, .1, mat.shape).astype(bool)
mat[masking_array] = np.nan
mat[:4, :4]
array([[ 0.56716186, -0.20344151, nan, -0.22579163],
[ nan, 1.98881836, -2.25445983, 1.27024191],
[ 0.29327486, -2.25445983, 3.15525425, -1.64685403],
[-0.22579163, 1.27024191, -1.64685403, 1.32240835]])

Great, now we can create a Pipeline.棒,现在我们可以创建一个Pipeline了:

How to do it...怎么做

Without Pipelines, the process will look something like the following:没有Pipelines,步骤应该是下面这样:

代码语言:javascript复制
from sklearn import preprocessing
from sklearn.impute import SampleImputer
impute = SampleImputer()
scaler = preprocessing.StandardScaler()
mat_imputed = impute.fit_transform(mat)
mat_imputed[:4, :4]
array([[ 0.56716186, -0.20344151, -0.80554023, -0.22579163],
[ 0.04235695, 1.98881836, -2.25445983, 1.27024191],
[ 0.29327486, -2.25445983, 3.15525425, -1.64685403],
[-0.22579163, 1.27024191, -1.64685403, 1.32240835]])
mat_imp_and_scaled = scaler.fit_transform(mat_imputed)
array([[ 2.235e 00, -6.291e-01, 1.427e-16, -7.496e-01],
[ 0.000e 00, 1.158e 00, -9.309e-01, 9.072e-01],
[ 1.068e 00, -2.301e 00, 2.545e 00, -2.323e 00],
[ -1.142e 00, 5.721e-01, -5.405e-01, 9.650e-01]])

Notice that the prior missing value is 0 . This is expected because this value was imputed using the mean strategy, and scaling subtracts the mean.Now that we've looked at a non-Pipeline example, let's look at how we can incorporate a Pipeline:

注意先前的缺失值是0,这里要求,使用均值来填充缺失值,然后缩放减去均值。我们已经看过了没有Pipeline的例子,让我们来看一下如何创建一个Pipeline:

代码语言:javascript复制
from sklearn import pipeline
pipe = pipeline.Pipeline([('impute', impute), ('scaler', scaler)])

Take a look at the Pipeline. As we can see, Pipeline defines the steps that designate the progression of methods:

看一下这个Pipeline,如我们所见,Pipeline定义多个步骤包括设定执行的方法。

代码语言:javascript复制
pipe
Pipeline(memory=None,
         steps=[('impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

This is the best part; simply call the fit_transform method on the pipe object. These separate steps are completed in a single step:

这是最精彩的部分,简单的在pipe对象上调用fit_transform方法,这些独立的步骤被合成了一步。

代码语言:javascript复制
new_mat = pipe.fit_transform(mat)
new_mat [:4, :4]
array([[ 2.235e 00, -6.291e-01, 1.427e-16, -7.496e-01],
       [ 0.000e 00, 1.158e 00, -9.309e-01, 9.072e-01],
       [ 1.068e 00, -2.301e 00, 2.545e 00, -2.323e 00],
       [ -1.142e 00, 5.721e-01, -5.405e-01, 9.650e-01]])

We can also confirm that the two different methods give the same result:

我们也可以确认下这两个不同的方法是否产生了同一个结果。

代码语言:javascript复制
np.array_equal(new_mat, mat_imp_and_scaled)
True

Beautiful!漂亮

Later in the book, we'll see just how powerful this concept is. It doesn't stop at preprocessing steps. It can easily extend to dimensionality reduction as well, fitting different learning methods.Dimensionality reduction is handled on it's own in the recipe Reducing dimensionality with PCA.

以后,我们将看到这个概念有多强大,它不止于预处理阶段,它同样能够被扩展应用到降维上,拟合不同学习方法。降维是PCA(主成分分析)的一种处理方法。

How it works...它怎么工作的

As mentioned earlier, almost every scikit-learn has a similar interface. The important ones that allow Pipelines to function are:1、 fit 2、transform 3、fit_transform (a convenience method)

像之前提到的,几乎每个scikit-learn模型都有一个相似的接口,对于执行Pipelines最重要的函数是:

1、 fit 2、transform 3、fit_transform (a convenience method)

To be specific, if a Pipeline has N objects, the first N-1 objects must implement both fit and transform , and the Nth object must implement at least fit . If this doesn't happen, an error will be thrown.

特别的,如果Pipeline有N个对象,前N-1个对象必须是能够执行拟合和转化的功能,最后一个对象必须至少能执行拟合功能,如果不行,讲会抛出错误。

Pipeline will work correctly if these conditions are met, but it is still possible that not every method will work properly. For example, pipe has a method, inverse_transform , which does exactly what the name entails. However, because the impute step doesn't have an inverse_transform method, this method call will fail:

如果这些条件都满足了,Pipeline才能正确执行,但是还是有可能某些方法不能恰当的执行。例如,pipe有一个inverse_transform方法,他必须像他的名字说的那样精确执行,然而因为填充步骤没有inverse_transform方法,他就会失败。

代码语言:javascript复制
pipe.inverse_transform(new_mat)
AttributeError: 'Imputer' object has no attribute 'inverse_transform'

However, this is possible with the scalar object:但是,对于缩放是可行的。

代码语言:javascript复制
scaler.inverse_transform(new_mat) [:4, :4]
array([[ 0.567, -0.203, -0.806, -0.226],
       [ 0.042, 1.989, -2.254, 1.27 ],
       [ 0.293, -2.254, 3.155, -1.647],
       [-0.226, 1.27 , -1.647, 1.322]])

Once a proper Pipeline is set up, it functions almost exactly how you'd expect. It's a series of for loops that fit and transform at each intermediate step, feeding the output to the subsequent transformation.

当一个正确的Pipeline创造完成,它的函数会向你期待的那样,正确的执行。他是一系列中间步骤如拟合和转换的一个环节,把他的结果给予后续的步骤。

To conclude this recipe, I'll try to answer the "why?" question. There are two main reasons:

1、 The first reason is convenience. The code becomes quite a bit cleaner; instead of calling fit and transform over and over, it is offloaded to sklearn .

2、 The second, and probably the more important, reason is cross validation. Models can become very complex. If a single step in Pipeline has tuning parameters, they might need to be tested; with a single step, the code overhead to test the parameters is low. However, five steps with all of their respective parameters can become difficult to test. Pipelines ease a lot of the burden.

在达成了这个方法后,我将试着回答‘为什么’这个问题,有两个主要原因:

1、第一是方便,代码变得更简洁,而不是反复调用拟合和转换函数,他为sklearn减负。

2、更重要的第二点是交叉验证,模型能够更复杂,当一个步骤成为关键步骤,那它就需要被测试,如果只有一步,代码从头检查是低成本的,然而若有5步且各有各的关键之处,测试将变得异常复杂,Pipelines将减轻这个负担。

0 人点赞