In the last recipe, we looked at transforming our data into the standard normal distribution.Now, we'll talk about another transformation, one that is quite different.
在以上的方法,我们拉进了把数据进行标准正态分布的变换,现在我们来讨论下区别较大的其他变换。
Instead of working with the distribution to standardize it, we'll purposely throw away data;but, if we have good reason, this can be a very smart move. Often, in what is ostensibly continuous data, there are discontinuities that can be determined via binary features.
除了做数据的标准化以外,我们可能会选择直接丢弃一部分数据,如果理由充分,这不失为一种聪明的办法。经常还会用二元特征来代替连续型的数据
Getting ready准备工作
Creating binary features and outcomes is a very useful method, but it should be used with caution. Let's use the boston dataset to learn how to create turn values in binary outcomes.
生成二元特征是非常有用的方法,但使用是还需谨慎,我们用波士顿数据集来学习如何生成二元输出。
First, load the boston dataset:首先载入波士顿数据集
代码语言:python代码运行次数:0复制from sklearn import datasets
boston = datasets.load_boston()
import numpy as np
How to do it...如何运行
Similar to scaling, there are two ways to binarize features in scikit-learn:
与缩放相似,scikit-learn有两种方法来二元化特征值
1、preprocessing.binarize #(a function)这是一个函数 2、preprocessing.Binarizer #(a class)这是一个类
The boston dataset's target variable is the median value of houses in thousands. This dataset is good to test regression and other continuous predictors, but consider a situation where we want to simply predict if a house's value is more than the overall mean. To do this,we will want to create a threshold value of the mean. If the value is greater than the mean,produce a 1; if it is less, produce a 0:
波士顿数据集的目标值是成百上千房子的中位数,这个数据集对于回归模型或者其他连续性的预测模型非常合适,但是设想一个情况,我们只是想预测下一个房子的价值比均值大还是小,为了解决这个问题,我们会用均值生成一个界值,当值大于均值时输出1,或者生成0.
代码语言:python代码运行次数:0复制from sklearn import preprocessing
new_target = preprocessing.binarize([boston.target],threshold=boston.target.mean())
new_target[:5]
array([ 1., 0., 1., 1., 1.])
This was easy, but let's check to make sure it worked correctly:这非常简单,让我们来检查下它的正确性。
代码语言:javascript复制(boston.target[:5] > boston.target.mean()).astype(int)
array([1, 0, 1, 1, 1])
Given the simplicity of the operation in NumPy, it's a fair question to ask why you will want to use the built-in functionality of scikit-learn. Pipelines, covered in the Using Pipelines for multiple preprocessing steps recipe, will go far to explain this; in anticipation of this,let's use the Binarizer class:
既然numpy有简单的方法来解决,那为什么要用scikit-learn的内建函数呢,Pipelines将很好的解释这个问题,大量的预处理过程中都离不开Pipelines,按照计划,我们来使用二元化的类:
代码语言:javascript复制bin = preprocessing.Binarizer(boston.target.mean())
new_target = bin.fit_transform(boston.target)
new_target[:5]
array([ 1., 0., 1., 1., 1.])
How it works...它怎样工作的?
Hopefully, this is pretty obvious; but under the hood, scikit-learn creates a conditional mask that is True if the value in the array in question is more than the threshold. It then updates the array to 1 where the condition is met, and 0 where it is not.
它很显然又在隐秘之下,当接收到的数据值不是一个阈值,而是一个数组,scikit-learn生成一个有条件的面具,当条件满足,则更新数组为1,否则为0.
There's more...拓展知识
Let's also learn about sparse matrices and the fit method.让我们来学习稀疏矩阵及拟合方法。
Sparse matrices稀疏矩阵
Sparse matrices are special in that zeros aren't stored; this is done in an effort to save space in memory. This creates an issue for the binarizer, so to combat it, a special condition for the binarizer for sparse matrices is that the threshold cannot be less than zero:
稀疏矩阵特殊之处在于不储存0,这样做可以节省内存空间,但是在二元化的时候会导致问题,为了解决这个问题,在二值化稀疏矩阵时加入一个特殊条件,阈值不能小于0:
代码语言:javascript复制from scipy.sparse import coo
spar = coo.coo_matrix(np.random.binomial(1, .25, 100))
preprocessing.binarize(spar, threshold=-1)
ValueError: Cannot binarize a sparse matrix with threshold < 0
The fit method拟合方法
The fit method exists for the binarizer transformation, but it will not fit anything, it will simply return the object.
存在于二值化处理的拟合方法除了返回对象,没有拟合任何东西。