Scaling data to the standard normal缩放数据到标准正态形式

2020-04-20 10:14:17 浏览数 (1)

A preprocessing step that is almost recommended is to scale columns to the standard normal. The standard normal is probably the most important distribution of all statistics.


If you've ever been introduced to statistics, you must have almost certainly seen z-scores.In truth, that's all this recipe is about—transforming our features from their endowed distribution into z-scores.


Getting ready准备

The act of scaling data is extremely useful. There are a lot of machine learning algorithms,which perform differently (and incorrectly) in the event the features exist at different scales.For example, SVMs perform poorly if the data isn't scaled because it uses a distance function in its optimization, which is biased if one feature varies from 0 to 10,000 and the other varies from 0 to 1.


The preprocessing module contains several useful functions to scale features:预处理模型中包含多个有用的函数来缩放特征。

from sklearn import preprocessing
import numpy as np 

How to do it...

Continuing with the boston dataset, run the following commands:继续使用波士顿数据集,运行一下代码。

X[:, :3].mean(axis=0) #mean of the first 3 features计算前三列的均值
array([ 3.59376071, 11.36363636, 11.13677866])
X[:, :3].std(axis=0) #计算前三列的标准差
array([ 8.58828355, 23.29939569, 6.85357058])

There's actually a lot to learn from this initially. Firstly, the first feature has the smallest mean but varies even more than the third feature. The second feature has the largest mean and standard deviation—it takes the widest spread of values:


X_2 = preprocessing.scale(X[:, :3])  #标准化数据
X_2.mean(axis=0)  #标准化后的均值
array([ 6.34099712e-17, -6.34319123e-16, -2.68291099e-15])
X_2.std(axis=0)  #标准化后的方差
array([ 1., 1., 1.])

How it works...函数做了什么

The center and scaling function is extremely simple. It merely subtracts the mean and divides by the standard deviation:


x = (X实际值-X均值)/标准差

In addition to a function, there is also a center and scaling class that is easy to invoke,and this is particularly useful when used in conjunction with the Pipelines mentioned later.

除了函数以外,还有个缩放的类也很容易被调用,当与后面的 Pipelines方法同时调用时,尤其的好用。

It's also useful for the center and scaling class to persist across individual scaling: 缩放类也支持独自缩放依然很实用。

my_scaler = preprocessing.StandardScaler()[:, :3])
my_scaler.transform(X[:, :3]).mean(axis=0)
array([ 6.34099712e-17, -6.34319123e-16, -2.68291099e-15])

Scaling features to mean 0 , and standard deviation 1 isn't the only useful type of scaling.Preprocessing also contains a MinMaxScaler class, which will scale the data within a certain range:


my_minmax_scaler = preprocessing.MinMaxScaler()[:, :3])
my_minmax_scaler.transform(X[:, :3]).max(axis=0)
array([ 1., 1., 1.])

It's very simple to change the minimum and maximum values of the MinMaxScaler class from its default of 0 and 1 , respectively:


my_odd_scaler = preprocessing.MinMaxScaler(feature_range=(-3.14,3.14))

Furthermore, another option is normalization. This will scale each sample to have a length of 1. This is different from the other types of scaling done previously, where the features were scaled. Normalization is illustrated in the following command:


normalized_X = preprocessing.normalize(X[:, :3])

If it's not apparent why this is useful, consider the Euclidian distance (a measure of similarity) between three of the samples, where one sample has the values (1, 1, 0), another has (3, 3,0), and the final has (1, -1, 0).The distance between the 1 st and 3 rd vector is less than the distance between the 1 st and 2 nd though the 1 st and 3 rd are orthogonal, whereas the 1 st and 2 nd only differ by a scalar factor of 3. Since distances are often used as measures of similarity, not normalizing the data first will be misleading..

它为什么有效并不那么显而易见,思考一下三个样本间的欧式距离,分别是 (1, 1, 0),(3, 3,0),(1, -1, 0) 其中1到3的距离小于1到2的距离,1与3正交,然而1与2只是多了一个放大3倍的因子,因此当用到距离来近似测量的话,没有进行正则化就直接误入歧途了。

There's more...深入:

Imputation is a very deep subject. Here are a few things to consider when using scikit-learn's implementation.


Creating idempotent scalar objects 生成幂等缩放工程

It is possible to scale the mean and/or variance in the StandardScaler instance.For instance, it's possible (though not useful) to create a StandardScaler instance,which simply performs the identity transformation:


my_useless_scaler = preprocessing.StandardScaler(with_mean=False,with_std=False) #可关闭
transformed_sd = my_useless_scale.fit_transform(X[:, :3]).std(axis=0)
original_sd = X[:, :3].std(axis=0)
np.array_equal(transformed_sd, original_sd)

Handling sparse imputations处理稀疏缺失值

Sparse matrices aren't handled differently from normal matrices when doing scaling. This is because to mean center the data, the data will have its 0s altered to nonzero values, thus the matrix will no longer be sparse:


matrix = scipy.sparse.eye(1000)
ValueError: Cannot center sparse matrices: pass 'with_mean=False' instead 
See docstring for motivation and alternatives.
As noted in the error, it is possible to scale a sparse matrix  with_std only:
preprocessing.scale(matrix, with_mean=False)
<1000x1000 sparse matrix of type '<type 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Row format>

The other option is to call todense() on the array. However, this is dangerous because the matrix is already sparse for a reason, and it will potentially cause a memory error.


0 人点赞