Scaling data to the standard normal缩放数据到标准正态形式

2020-04-20 10:14:17 浏览数 (1)

A preprocessing step that is almost recommended is to scale columns to the standard normal. The standard normal is probably the most important distribution of all statistics.

一个非常需要被推荐的预处理步骤是放缩列数据到标准正态形式,标准正态形式可能是在统计学中最重要的部分。

If you've ever been introduced to statistics, you must have almost certainly seen z-scores.In truth, that's all this recipe is about—transforming our features from their endowed distribution into z-scores.

如果你已经初步了解了统计学,那你一定见过Z分数,事实上,这一步就是把我们的特征值从原本的样子缩放成Z分数。

Getting ready准备

The act of scaling data is extremely useful. There are a lot of machine learning algorithms,which perform differently (and incorrectly) in the event the features exist at different scales.For example, SVMs perform poorly if the data isn't scaled because it uses a distance function in its optimization, which is biased if one feature varies from 0 to 10,000 and the other varies from 0 to 1.

标准化数据非常的有用,很多机器学习算法在是否数据标准化的情况下,会表现出不同的结果甚至是出现错误,比如,支持向量机由于在优化算法过程中使用的是距离函数,在一组数据特征的变量范围是0到10000,另一组是0到1时,没有做数据标准化,训练效果表现的会非常差。

The preprocessing module contains several useful functions to scale features:预处理模型中包含多个有用的函数来缩放特征。

代码语言:python代码运行次数:0复制
from sklearn import preprocessing
import numpy as np 

How to do it...

Continuing with the boston dataset, run the following commands:继续使用波士顿数据集,运行一下代码。

代码语言:python代码运行次数:0复制
X[:, :3].mean(axis=0) #mean of the first 3 features计算前三列的均值
array([ 3.59376071, 11.36363636, 11.13677866])
X[:, :3].std(axis=0) #计算前三列的标准差
array([ 8.58828355, 23.29939569, 6.85357058])

There's actually a lot to learn from this initially. Firstly, the first feature has the smallest mean but varies even more than the third feature. The second feature has the largest mean and standard deviation—it takes the widest spread of values:

从这实际上能发现很多,首先,第一个特征的均值最小,但标准差变化比第三个特征大,而第二个特征有最大的均值和标准差(值范围分布的最广)

代码语言:python代码运行次数:0复制
X_2 = preprocessing.scale(X[:, :3])  #标准化数据
X_2.mean(axis=0)  #标准化后的均值
array([ 6.34099712e-17, -6.34319123e-16, -2.68291099e-15])
X_2.std(axis=0)  #标准化后的方差
array([ 1., 1., 1.])

How it works...函数做了什么

The center and scaling function is extremely simple. It merely subtracts the mean and divides by the standard deviation:

特征缩放函数的核心非常简单,它仅仅是减去均值以后,再除以方差:

x = (X实际值-X均值)/标准差

In addition to a function, there is also a center and scaling class that is easy to invoke,and this is particularly useful when used in conjunction with the Pipelines mentioned later.

除了函数以外,还有个缩放的类也很容易被调用,当与后面的 Pipelines方法同时调用时,尤其的好用。

It's also useful for the center and scaling class to persist across individual scaling: 缩放类也支持独自缩放依然很实用。

代码语言:python代码运行次数:0复制
my_scaler = preprocessing.StandardScaler()
my_scaler.fit(X[:, :3])
my_scaler.transform(X[:, :3]).mean(axis=0)
array([ 6.34099712e-17, -6.34319123e-16, -2.68291099e-15])

Scaling features to mean 0 , and standard deviation 1 isn't the only useful type of scaling.Preprocessing also contains a MinMaxScaler class, which will scale the data within a certain range:

缩放到均值为0,标准差为1并不是仅有的缩放类型,预处理函数还包含了明确了范围的最大最小值的缩放。

代码语言:python代码运行次数:0复制
my_minmax_scaler = preprocessing.MinMaxScaler()
my_minmax_scaler.fit(X[:, :3])
my_minmax_scaler.transform(X[:, :3]).max(axis=0)
array([ 1., 1., 1.])

It's very simple to change the minimum and maximum values of the MinMaxScaler class from its default of 0 and 1 , respectively:

从它默认的0到1的值很容易通过最大最小缩放的类来改变它的上下限的值

代码语言:python代码运行次数:0复制
my_odd_scaler = preprocessing.MinMaxScaler(feature_range=(-3.14,3.14))

Furthermore, another option is normalization. This will scale each sample to have a length of 1. This is different from the other types of scaling done previously, where the features were scaled. Normalization is illustrated in the following command:

然而,另一种选择是归一化,它会把每个特征都缩放到长度范围为1,这和以前的其他特征缩放都不相同,正则化在下面的代码中展示:

代码语言:python代码运行次数:0复制
normalized_X = preprocessing.normalize(X[:, :3])

If it's not apparent why this is useful, consider the Euclidian distance (a measure of similarity) between three of the samples, where one sample has the values (1, 1, 0), another has (3, 3,0), and the final has (1, -1, 0).The distance between the 1 st and 3 rd vector is less than the distance between the 1 st and 2 nd though the 1 st and 3 rd are orthogonal, whereas the 1 st and 2 nd only differ by a scalar factor of 3. Since distances are often used as measures of similarity, not normalizing the data first will be misleading..

它为什么有效并不那么显而易见,思考一下三个样本间的欧式距离,分别是 (1, 1, 0),(3, 3,0),(1, -1, 0) 其中1到3的距离小于1到2的距离,1与3正交,然而1与2只是多了一个放大3倍的因子,因此当用到距离来近似测量的话,没有进行正则化就直接误入歧途了。

There's more...深入:

Imputation is a very deep subject. Here are a few things to consider when using scikit-learn's implementation.

探索原因是非常深入的课题,此处讲一点关于使用scikit-learn时需要考虑的东西。

Creating idempotent scalar objects 生成幂等缩放工程

It is possible to scale the mean and/or variance in the StandardScaler instance.For instance, it's possible (though not useful) to create a StandardScaler instance,which simply performs the identity transformation:

在实例中经常用到对均值或变量的缩放,可以生成一个标准缩放的实例,能够表示它自身的缩放变化。

代码语言:javascript复制
my_useless_scaler = preprocessing.StandardScaler(with_mean=False,with_std=False) #可关闭
transformed_sd = my_useless_scale.fit_transform(X[:, :3]).std(axis=0)
original_sd = X[:, :3].std(axis=0)
np.array_equal(transformed_sd, original_sd)
True

Handling sparse imputations处理稀疏缺失值

Sparse matrices aren't handled differently from normal matrices when doing scaling. This is because to mean center the data, the data will have its 0s altered to nonzero values, thus the matrix will no longer be sparse:

稀疏矩阵不能像正常矩阵一样进行放缩,因为它会用0值去替换非0值,这样矩阵就不在稀疏。

代码语言:javascript复制
matrix = scipy.sparse.eye(1000)
preprocessing.scale(matrix)
…
ValueError: Cannot center sparse matrices: pass 'with_mean=False' instead 
See docstring for motivation and alternatives.
As noted in the error, it is possible to scale a sparse matrix  with_std only:
注意到这个错误,缩放的时候,就只缩放他的标准差即可
代码语言:javascript复制
preprocessing.scale(matrix, with_mean=False)
<1000x1000 sparse matrix of type '<type 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Row format>

The other option is to call todense() on the array. However, this is dangerous because the matrix is already sparse for a reason, and it will potentially cause a memory error.

另一个选择是调用密集函数,这很危险,因为矩阵已经缺失,它将导致一个内存错误。

0 人点赞