作者:Yingxiang Chen & Zihan Yang
编辑:红色石头
特征工程在机器学习中的重要性不言而喻,恰当的特征工程能显著提升机器学习模型性能。我们在 Github 上整理编写了一份系统的特征工程教程,供大家参考学习。
项目地址:
https://github.com/YC-Coder-Chen/feature-engineering-handbook
本文将探讨数据预处理部分:介绍了如何利用 scikit-learn 处理静态的连续变量,利用 Category Encoders 处理静态的类别变量以及利用 Featuretools 处理常见的时间序列变量。
目录
特征工程的数据预处理我们将分为三大部分来介绍:
- 静态连续变量
- 静态类别变量
- 时间序列变量
本文将介绍 1.1 静态连续变量的数据预处理。下面将结合 Jupyter,使用 sklearn,进行详解。
1.1 静态连续变量
1.1.1 离散化
离散化连续变量可以使模型更加稳健。例如,当预测客户的购买行为时,一个已有 30 次购买行为的客户可能与一个已有 32 次购买行为的客户具有非常相似的行为。有时特征中的过精度可能是噪声,这就是为什么在 LightGBM 中,模型采用直方图算法来防止过拟合。离散连续变量有两种方法。
1.1.1.1 二值化
将数值特征二值化。
代码语言:javascript复制# load the sample data
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
X, y = dataset.data, dataset.target # we will take the first column as the example later
代码语言:javascript复制%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
sns.distplot(X[:,0], hist = True, kde=True)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
代码语言:javascript复制from sklearn.preprocessing import Binarizer
sample_columns = X[0:10,0] # select the top 10 samples
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
model = Binarizer(threshold=6) # set 6 to be the threshold
# if value <= 6, then return 0 else return 1
result = model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1)
# return array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0.])
1.1.1.2 分箱
将数值特征分箱。
均匀分箱:
代码语言:javascript复制from sklearn.preprocessing import KBinsDiscretizer
# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # set 5 bins
# return oridinal bin number, set all bins to have identical widths
model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([2., 2., 2., 1., 1., 1., 1., 0., 0., 1.])
bin_edge = model.bin_edges_[0]
# return array([ 0.4999 , 3.39994, 6.29998, 9.20002, 12.10006, 15.0001 ]), the bin edges
代码语言:javascript复制# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)
for edge in bin_edge: # uniform bins
line = plt.axvline(edge, color='b')
ax.legend([line], ['Uniform Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);
分位数分箱:
代码语言:javascript复制from sklearn.preprocessing import KBinsDiscretizer
# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # set 3 bins
# return oridinal bin number, set all bins based on quantile
model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([4., 4., 4., 4., 2., 3., 2., 1., 0., 2.])
bin_edge = model.bin_edges_[0]
# return array([ 0.4999 , 2.3523 , 3.1406 , 3.9667 , 5.10824, 15.0001 ]), the bin edges
# 2.3523 is the 20% quantile
# 3.1406 is the 40% quantile, etc..
代码语言:javascript复制# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)
for edge in bin_edge: # quantile based bins
line = plt.axvline(edge, color='b')
ax.legend([line], ['Quantiles Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);
1.1.2 缩放
不同尺度的特征之间难以比较,特别是在线性回归和逻辑回归等线性模型中。在基于欧氏距离的 k-means 聚类或 KNN 模型中,就需要进行特征缩放,否则距离的测量是无用的。而对于任何使用梯度下降的算法,缩放也会加快收敛速度。
一些常用的模型:
注:偏度影响 PCA 模型,因此最好使用幂变换来消除偏度。
1.1.2.1 标准缩放(Z 分数标准化)
公式:
其中,X 是变量(特征),???? 是 X 的均值,???? 是 X 的标准差。此方法对异常值非常敏感,因为异常值同时影响到 ???? 和 ????。
代码语言:javascript复制from sklearn.preprocessing import StandardScaler
# in order to mimic the operation in real-world, we shall fit the StandardScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = StandardScaler()
model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.34539745, 2.33286782, 1.78324852, 0.93339178, -0.0125957 ,
# 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626])
# result is the same as ((X[0:10,0] - X[10:,0].mean())/X[10:,0].std())
代码语言:javascript复制# visualize the distribution after the scaling
# fit and transform the entire first feature
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
model = StandardScaler()
model.fit(X[:,0].reshape(-1,1))
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)
# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()
1.1.2.2 MinMaxScaler(按数值范围缩放)
假设我们要缩放的特征数值范围为 (a, b)。
公式:
其中,Min 是 X 的最小值,Max 是 X 的最大值。此方法也对异常值非常敏感,因为异常值同时影响到 Min 和 Max。
代码语言:javascript复制from sklearn.preprocessing import MinMaxScaler
# in order to mimic the operation in real-world, we shall fit the MinMaxScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = MinMaxScaler(feature_range=(0,1)) # set the range to be (0,1)
model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([0.53966842, 0.53802706, 0.46602805, 0.35469856, 0.23077613,
# 0.24392077, 0.21787286, 0.18069406, 0.1089985 , 0.22008662])
# result is the same as (X[0:10,0] - X[10:,0].min())/(X[10:,0].max()-X[10:,0].min())
代码语言:javascript复制# visualize the distribution after the scaling
# fit and transform the entire first feature
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
model = MinMaxScaler(feature_range=(0,1))
model.fit(X[:,0].reshape(-1,1))
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)
# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout() # now the scale change to [0,1]
1.1.2.3 RobustScaler(抗异常值缩放)
使用对异常值稳健的统计(分位数)来缩放特征。假设我们要将缩放的特征分位数范围为 (a, b)。
公式:
这种方法对异常点鲁棒性更强。
代码语言:javascript复制import numpy as np
from sklearn.preprocessing import RobustScaler
# in order to mimic the operation in real-world, we shall fit the RobustScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = RobustScaler(with_centering = True, with_scaling = True,
quantile_range = (25.0, 75.0))
# with_centering = True => recenter the feature by set X' = X - X.median()
# with_scaling = True => rescale the feature by the quantile set by user
# set the quantile to the (25%, 75%)
model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.19755974, 2.18664281, 1.7077657 , 0.96729508, 0.14306683,
# 0.23049401, 0.05724508, -0.19003715, -0.66689601, 0.07196918])
# result is the same as (X[0:10,0] - np.quantile(X[10:,0], 0.5))/(np.quantile(X[10:,0],0.75)-np.quantile(X[10:,0], 0.25))
代码语言:javascript复制# visualize the distribution after the scaling
# fit and transform the entire first feature
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
model = RobustScaler(with_centering = True, with_scaling = True,
quantile_range = (25.0, 75.0))
model.fit(X[:,0].reshape(-1,1))
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)
# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()
1.1.2.4 幂次变换(非线性变换)
以上介绍的所有缩放方法都保持原来的分布。但正态性是许多统计模型的一个重要假设。我们可以使用幂次变换将原始分布转换为正态分布。
Box-Cox 变换:
Box-Cox 变换只适用于正数,并假设如下分布:
考虑了所有的 λ 值,通过最大似然估计选择稳定方差和最小化偏度的最优值。
代码语言:javascript复制from sklearn.preprocessing import PowerTransformer
# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = PowerTransformer(method='box-cox', standardize=True)
# apply box-cox transformation
model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 1.91669292, 1.91009687, 1.60235867, 1.0363095 , 0.19831579,
# 0.30244247, 0.09143411, -0.24694006, -1.08558469, 0.11011933])
代码语言:javascript复制# visualize the distribution after the scaling
# fit and transform the entire first feature
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
model = PowerTransformer(method='box-cox', standardize=True)
model.fit(X[:,0].reshape(-1,1))
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)
# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()
Yeo-Johnson 变换:
Yeo Johnson 变换适用于正数和负数,并假设以下分布:
考虑了所有的 λ 值,通过最大似然估计选择稳定方差和最小化偏度的最优值。
代码语言:javascript复制from sklearn.preprocessing import PowerTransformer
# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set
test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]
model = PowerTransformer(method='yeo-johnson', standardize=True)
# apply box-cox transformation
model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 1.90367888, 1.89747091, 1.604735 , 1.05166306, 0.20617221,
# 0.31245176, 0.09685566, -0.25011726, -1.10512438, 0.11598074])
代码语言:javascript复制# visualize the distribution after the scaling
# fit and transform the entire first feature
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
model = PowerTransformer(method='yeo-johnson', standardize=True)
model.fit(X[:,0].reshape(-1,1))
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)
# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()
1.1.3 正则化
以上所有缩放方法都是按列操作的。但正则化在每一行都有效,它试图“缩放”每个样本,使其具有单位范数。由于正则化在每一行都起作用,它会扭曲特征之间的关系,因此不常见。但是正则化方法在文本分类和聚类上下文中是非常有用的。
假设 X[i][j] 表示样本 i 中特征 j 的值。
L1 正则化公式:
L2 正则化公式:
L1 正则化:
代码语言:javascript复制from sklearn.preprocessing import Normalizer
# Normalizer performs operation on each row independently
# So train set and test set are processed independently
###### for L1 Norm
sample_columns = X[0:2,0:3] # select the first two samples, and the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [ 8.3014 , 21., 6.23813708]])
model = Normalizer(norm='l1')
# use L2 Norm to normalize each samples
model.fit(sample_columns)
result = model.transform(sample_columns) # test set are processed similarly
# return array([[0.14784762, 0.72812094, 0.12403144],
# [0.23358211, 0.59089121, 0.17552668]])
# result = sample_columns/np.sum(np.abs(sample_columns), axis=1).reshape(-1,1)
L2 正则化:
代码语言:javascript复制###### for L2 Norm
sample_columns = X[0:2,0:3] # select the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [ 8.3014 , 21., 6.23813708]])
model = Normalizer(norm='l2')
# use L2 Norm to normalize each samples
model.fit(sample_columns)
result = model.transform(sample_columns)
# return array([[0.19627663, 0.96662445, 0.16465922],
# [0.35435076, 0.89639892, 0.26627902]])
# result = sample_columns/np.sqrt(np.sum(sample_columns**2, axis=1)).reshape(-1,1)
代码语言:javascript复制# visualize the difference in the distribuiton after Normalization
# compare it with the distribuiton after RobustScaling
# fit and transform the entire first & second feature
import seaborn as sns
import matplotlib.pyplot as plt
# RobustScaler
fig, ax = plt.subplots(2,1, figsize = (13,9))
model = RobustScaler(with_centering = True, with_scaling = True,
quantile_range = (25.0, 75.0))
model.fit(X[:,0:2])
result = model.transform(X[:,0:2])
sns.scatterplot(result[:,0], result[:,1], ax=ax[0])
ax[0].set_title('Scatter Plot of RobustScaling result', fontsize=12)
ax[0].set_xlabel('Feature 1', fontsize=12)
ax[0].set_ylabel('Feature 2', fontsize=12);
model = Normalizer(norm='l2')
model.fit(X[:,0:2])
result = model.transform(X[:,0:2])
sns.scatterplot(result[:,0], result[:,1], ax=ax[1])
ax[1].set_title('Scatter Plot of Normalization result', fontsize=12)
ax[1].set_xlabel('Feature 1', fontsize=12)
ax[1].set_ylabel('Feature 2', fontsize=12);
fig.tight_layout() # Normalization distort the original distribution
1.1.4 缺失值的估算
在实际操作中,数据集中可能缺少值。然而,这种稀疏的数据集与大多数 scikit 学习模型不兼容,这些模型假设所有特征都是数值的,而没有丢失值。所以在应用 scikit 学习模型之前,我们需要估算缺失的值。
但是一些新的模型,比如在其他包中实现的 XGboost、LightGBM 和 Catboost,为数据集中丢失的值提供了支持。所以在应用这些模型时,我们不再需要填充数据集中丢失的值。
1.1.4.1 单变量特征插补
假设第 i 列中有缺失值,那么我们将用常数或第 i 列的统计数据(平均值、中值或模式)对其进行估算。
代码语言:javascript复制from sklearn.impute import SimpleImputer
test_set = X[0:10,0].copy() # no missing values
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
# manully create some missing values
test_set[3] = np.nan
test_set[6] = np.nan
# now sample_columns becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])
# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,0].copy()
train_set[3] = np.nan
train_set[6] = np.nan
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean
# we can set the strategy to 'mean', 'median', 'most_frequent', 'constant'
imputer.fit(train_set.reshape(-1,1))
result = imputer.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([8.3252 , 8.3014 , 7.2574 , 3.87023658, 3.8462 ,
# 4.0368 , 3.87023658, 3.12 , 2.0804 , 3.6912 ])
# all missing values are imputed with 3.87023658
# 3.87023658 = np.nanmean(train_set)
# which is the mean of the trainset ignoring missing values
1.1.4.2 多元特征插补
多元特征插补利用整个数据集的信息来估计和插补缺失值。在 scikit-learn 中,它以循环迭代的方式实现。
在每一步中,一个特征列被指定为输出 y,其他特征列被视为输入 X。一个回归器适用于已知 y 的(X,y)。然后,回归器被用来预测 y 的缺失值。这是以迭代的方式对每个特征进行的,然后对最大值插补回合重复进行。
使用线性模型(以 BayesianRidge 为例):
代码语言:javascript复制from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])
# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan
impute_estimator = BayesianRidge()
imputer = IterativeImputer(max_iter = 10,
random_state = 0,
estimator = impute_estimator)
imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252 , 8.3014 , 7.2574 , 4.6237195 , 3.8462 ,
# 4.0368 , 4.00258149, 3.12 , 2.0804 , 3.6912 ])
使用基于树的模型(以 ExtraTrees 为例):
代码语言:javascript复制from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])
# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan
impute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0)
# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10,
random_state = 0,
estimator = impute_estimator)
imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252 , 8.3014 , 7.2574 , 4.63813, 3.8462 , 4.0368 , 3.24721,
# 3.12 , 2.0804 , 3.6912 ])
使用 K 近邻(KNN):
代码语言:javascript复制from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor
test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])
# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan
impute_estimator = KNeighborsRegressor(n_neighbors=10,
p = 1) # set p=1 to use manhanttan distance
# use manhanttan distance to reduce effect from outliers
# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10,
random_state = 0,
estimator = impute_estimator)
imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052 , 3.12 ,
# 2.0804, 3.6912])
1.1.4.3 标记估算值
有时,某些缺失值可能是有用的。因此,scikit learn 还提供了将缺少值的数据集转换为相应的二进制矩阵的功能,该矩阵指示数据集中缺少值的存在。
代码语言:javascript复制from sklearn.impute import MissingIndicator
# illustrate this function on trainset only
# since the precess is independent in train set and test set
train_set = X[10:,:].copy() # select all features
train_set[3,0] = np.nan # manully create some missing values
train_set[6,0] = np.nan
train_set[3,1] = np.nan
indicator = MissingIndicator(missing_values=np.nan, features='all')
# show the results on all the features
result = indicator.fit_transform(train_set) # result have the same shape with train_set
# contains only True & False, True corresponds with missing value
result[:,0].sum() # should return 2, the first column has two missing values
result[:,1].sum(); # should return 1, the second column has one missing value
1.1.5 特征变换
1.1.5.1 多项式变换
有时我们希望在模型中引入非线性特征,从而增加模型的复杂度。对于简单的线性模型,这将大大增加模型的复杂度。但是对于更复杂的模型,如基于树的 ML 模型,它们已经在非参数树结构中包含了非线性关系。因此,这种特性转换可能对基于树的 ML 模型没有太大帮助。
例如,如果我们将阶数设置为 3,形式如下:
代码语言:javascript复制from sklearn.preprocessing import PolynomialFeatures
# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])
poly = PolynomialFeatures(degree = 3, interaction_only = False)
# the highest degree is set to 3, and we want more than just intereaction terms
result = poly.fit_transform(train_set) # have shape (1, 10)
# array([[ 1., 2., 3., 4., 6., 9., 8., 12., 18., 27.]])
1.1.5.2 自定义变换
代码语言:javascript复制from sklearn.preprocessing import FunctionTransformer
# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])
transformer = FunctionTransformer(func = np.log1p, validate=True)
# perform log transformation, X' = log(1 x)
# func can be any numpy function such as np.exp
result = transformer.transform(train_set)
# return array([[1.09861229, 1.38629436]]), the same as np.log1p(train_set)
好了,以上就是关于静态连续变量的数据预处理介绍。建议读者结合代码,在 Jupyter 中实操一遍。