Imputing missing values through various strategies填充处理缺失值的不同方法

2020-04-20 10:14:48 浏览数 (1)

Data imputation is critical in practice, and thankfully there are many ways to deal with it.In this recipe, we'll look at a few of the strategies. However, be aware that there might be other approaches that fit your situation better.

实际应用中数据处理至关重要,好在有很多种方法可以解决这个问题,我们来介绍一些方法,但是记住,注意那些对于自己的实际情况最合适的方法。

This means scikit-learn comes with the ability to perform fairly common imputations; it will simply apply some transformations to the existing data and fill the NAs. However, if the dataset is missing data, and there's a known reason for this missing data—for example, response times for a server that times out after 100ms—it might be better to take a statistical approach through other packages such as the Bayesian treatment via PyMC, the Hazard Models via Lifelines, or something home-grown.

其实scikit-learn自身带有一些处理方式,它可能对已知数据情况执行一些简单的变换和填充Na值,然而,当数据有缺失值,或者有不清楚原因的缺失值(例如服务器响应时间超时导致),这些值或许用其他包或者方法来填入一个符合统计规律的数字更合适,比如PyMC的贝叶斯方法,Lifelines里的Hazard Models,或者其他自定义的方法。

Getting ready准备工作

The first thing to do to learn how to input missing values is to create missing values. NumPy's masking will make this extremely simple:

学习如何填充缺失值前,首先学习如何生成带缺失值的数据,Numpy可以用蒙版函数非常简单的实现。

代码语言:javascript复制
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
iris_X = iris.data
masking_array = np.random.binomial(1, .25,iris_X.shape).astype(bool)
iris_X[masking_array] = np.nan

To unravel this a bit, in case NumPy isn't too familiar, it's possible to index arrays with other arrays in NumPy. So, to create the random missing data, a random Boolean array is created,which is of the same shape as the iris dataset. Then, it's possible to make an assignment via the masked array. It's important to note that because a random array is used, it is likely your masking_array will be different from what's used here.To make sure this works, use the following command (since we're using a random mask,it might not match directly):

一点一点来解释,以防对Numpy不熟悉,numpy可以用一个数组索引另一个数组,所以为了生成随机的缺失数据,先创建一个和iris数据集形状相同的随机布尔型数组,然后就可以用蒙版数组处理它了,需要注意你用来一个随机函数,所以会发生你的蒙版数组和此处示例中的不一样,为了让它执行,请使用下面的代码(尽管我们使用了随机数蒙版,但是它并不会立即匹配)

代码语言:javascript复制
masking_array[:5]
array([[False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False],
       [False, False, False, False]], dtype=bool)
iris_X [:5]
array([[ 5.1, 3.5, 1.4, 0.2],
       [ 4.9, 3. , 1.4, 0.2],
       [ 4.7, 3.2, 1.3, 0.2],
       [ nan, 3.1, 1.5, 0.2],
       [ 5. , 3.6, 1.4, 0.2]])

How to do it...如何做

A theme prevalent throughout this book (due to the theme throughout scikit-learn) is reusable classes that fit and transform datasets and that can subsequently be used to transform unseen datasets. This is illustrated as follows:

本书的一个普遍的思想(也是scikit-learn中普遍的思想)就是使用可重用的类,它能够拟合和转换数据集,还能被用于未知的数据集,请看下面的介绍:

代码语言:javascript复制
# from sklearn import preprocessing
# impute = preprocessing.Imputer()  # 以前可用

from sklearn.impute import SampleImputer  # 新版本可用
impute = SimpleImputer()
iris_X_prime = impute.fit_transform(iris_X)
iris_X_prime[:5]
array([[ 5.1 , 3.5 , 1.4 , 0.2 ],
       [ 4.9 , 3. , 1.4 , 0.2 ],
       [ 4.7 , 3.2 , 1.3 , 0.2 ],
       [ 5.87923077, 3.1 , 1.5 , 0.2 ],
       [ 5. , 3.6 , 1.4 , 0.2 ]])

Notice the difference in the position [3, 0]:注意位置[3,0]处数据的不同

代码语言:javascript复制
iris_X_prime[3, 0]
5.87923077
iris_X[3, 0]
nan

How it works...它怎么做的

The imputation works by employing different strategies. The default is mean , but in total there are:

1、mean (default) 2、median 3、most_frequent (the mode)

scikit-learn will use the selected strategy to calculate the value for each non-missing value in the dataset. It will then simply fill the missing values.For example, to redo the iris example with the median strategy, simply reinitialize impute with the new strategy:

根据不同的规则填入填充值,默认的是均值,实际上还有:1、均值,2、中位数,3、众数。

scikit-learn使用选择的规则来为数据集中每一个缺失值计算填充值,然后填充。例如,使用中位数重新处理iris数据集,只要用新的规则重置填充即可。

代码语言:javascript复制
>>> impute = preprocessing.Imputer(strategy='median')
>>> iris_X_prime = impute.fit_transform(iris_X)
>>> iris_X_prime[:5]
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 5.8, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2]])

If the data is missing values, it might be inherently dirty in other places. For instance, in the example in the preceding How to do it... section, np.nan (the default missing value) was used as the missing value, but missing values can be represented in many ways. Conside a situation where missing values are -1 . In addition to the strategy to compute the missing value, it's also possible to specify the missing value for the imputer. The default is Nan ,which will handle np.nan values.To see an example of this, modify iris_X to have -1 as the missing value. It sounds crazy,but since the iris dataset contains measurements that are always possible, many people will fill the missing values with -1 to signify they're not there:

如果数据含缺失值,在其他地方可能就会是脏数据,例如,在之前的例子中,np.nan(默认缺失值)被用于表示缺失值,但是缺失值还有很多其他的代替方式,设想一种缺失值是-1的情形,用这样的规则计算缺失值。当然可以用特别的值来做填充,默认是用Nan来代替缺失值,看一下这个例子,调整iris_X,用-1作为缺失值,这听起来很疯狂,但当iris数据集包含长度数据,这就是可能的。很多人也会给缺失值填充-1表示不存在:

代码语言:javascript复制
iris_X[np.isnan(iris_X)] = -1
iris_X[:5]
array([[ 5.1, 3.5, 1.4, 0.2],
       [ 4.9, 3. , 1.4, 0.2],
       [ 4.7, 3.2, 1.3, 0.2],
       [-1. , 3.1, 1.5, 0.2],
       [ 5. , 3.6, 1.4, 0.2]])

There's more...扩展阅读

pandas also provides a functionality to fill missing data. It actually might be a bit more flexible,but it is less reusable:

pandas也提供一个功能来填充缺失值,它可能更灵活,但是缺乏重用性。

代码语言:javascript复制
import pandas as pd
iris_X[masking_array] = np.nan
iris_df = pd.DataFrame(iris_X, columns=iris.feature_names)
iris_df.fillna(iris_df.mean())['sepal length (cm)'].head(5)
0 5.100000
1 4.900000
2 4.700000
3 5.879231
4 5.000000
Name: sepal length (cm), dtype: float64

To mention its flexibility, fillna can be passed any sort of statistic, that is, the strategy is more arbitrarily defined:

他的灵活性,fillna可以填充任何统计类别,意味着它的填充规则可以随心所欲的制定。

代码语言:javascript复制
iris_df.fillna(iris_df.max())['sepal length (cm)'].head(5)
0 5.1
1 4.9
2 4.7
3 7.9
4 5.0
Name: sepal length (cm), dtype: float64

0 人点赞