Now it's time to take the math up a level! Principal component analysis (PCA) is the first somewhat advanced technique discussed in this book. While everything else thus far has been simple statistics, PCA will combine statistics and linear algebra to produce a preprocessing step that can help to reduce dimensionality, which can be the enemy of a simple model.
现在是时候提高下有关数学的档次了,主成分分析PCA是本书里第一个要讨论的高级技术。如果到目前其他方法都是简单的统计数据,PCA将包含统计方法和线性代数来生成预处理步骤,能用于降低纬度,维度过多是简单模型敌人。
Getting ready准备工作
PCA is a member of the decomposition module of scikit-learn. There are several othe decomposition methods available, which will be covered later in this recipe.Let's use the iris dataset, but it's better if you use your own data:
PCA是scikit-learn分解模块之一,再以后的学习中还会覆盖很多分解方法。让我们来使用iris数据集,同样,用你自己的数据更佳。
代码语言:javascript复制from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
How to do it...怎么做
First, import the decomposition module:首先导入分解模型:
代码语言:javascript复制from sklearn import decomposition
Next, instantiate a default PCA object:然后定义一个PCA对象
代码语言:javascript复制pca = decomposition.PCA()
pca
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
Compared to other objects in scikit-learn, PCA takes relatively few arguments. Now that the PCA object is created, simply transform the data by calling the fit_transform method,with iris_X as the argument:
相对于scikit-learn里的其他对象,PCA的参数相对已经很少了,现在已经生成了PCA对象并以 iris_X作为参数,调用fit_transform方法来实现简单的转换数据。
代码语言:javascript复制iris_pca = pca.fit_transform(iris_X)
iris_pca[:5]
array([[ -2.684e 00, -3.266e-01, 2.151e-02, 1.006e-03],
[ -2.715e 00, 1.696e-01, 2.035e-01, 9.960e-02],
[ -2.890e 00, 1.373e-01, -2.471e-02, 1.930e-02],
[ -2.746e 00, 3.111e-01, -3.767e-02, -7.596e-02],
[ -2.729e 00, -3.339e-01, -9.623e-02, -6.313e-02]])
Now that the PCA has been fit, we can see how well it has done at explaining the variance (explained in the following How it works... section):
现在PCA已经进行拟合,让我们通过可解释变异来看看发生了什么(以下是解释它如何工作的)
代码语言:javascript复制pca.explained_variance_ratio_
array([ 0.925, 0.053, 0.017, 0.005])
How it works...它怎么工作的
PCA has a general mathematic definition and a specific use case in data analysis. PCA finds the set of orthogonal directions that represent the original data matrix.
PCA有一个总体的数学定义和统计学的特例,PCA找到正交方向的集合来代替原本的矩阵。
Generally, PCA works by mapping the original dataset into a new space where the new column vectors of the matrix are each orthogonal. From a data analysis perspective, PCA transforms the covariance matrix of the data into column vectors that can "explain" certain percentages of the variance. For example, with the iris dataset, 92.5 percent of the variance of the overall dataset can be explained by the first component.
总体来说,PCA将原始数据映射到矩阵的列向量均正交的新的空间,从数据分析的观点来说,PCA将有协方差的数据转换成能解释的、有确定比例偏差的列向量。例如,在iris数据集里,第一个组件就能解释所有数据中92.5%维度的特征
This is extremely useful because dimensionality is problematic in data analysis. Quite often,algorithms applied to high-dimensional datasets will overfit on the initial training, and thus loose generality to the test set. If most of the underlying structure of the data can be faithfully represented by fewer dimensions, then it's generally considered a worthwhile trade-off. To demonstrate this, we'll apply the PCA transformation to the iris dataset and only include two dimensions. The iris dataset can normally be separated quite well using all the dimensions:
这对于因维度产生问题的数据分析非常有用(维度灾难),算法在初始训练时应用高维数据集将造成过拟合,这将影响到测试集的泛化能力,如果大量隐藏的数据结构能被少量维度准确的代替,然后它能达成一个最优平衡。为了证明这个,我们应用PCA变换iris数据集到只含有两个维度,iris数据集使用所有的维度通常会被分割的非常好。
代码语言:javascript复制pca = decomposition.PCA(n_components=2)
iris_X_prime = pca.fit_transform(iris_X)
iris_X_prime.shape
(150, 2)
Our data matrix is now 150 x 2, instead of 150 x 4.我们的矩阵从150*4变成了150*2.
The usefulness of two dimensions is that it is now very easy to plot.实用的二维数据很容易用图形展示
The separability of the classes remain even after reducing the number of dimensionality by two.We can see how much of the variance is represented by the two components that remain:
这是减少维度后最终的分类情况,我们可以看一下这两个成分包含了多少之前的维度特征。
代码语言:javascript复制pca.explained_variance_ratio_.sum()
0.977685206318795
There's more...扩展阅读
The PCA object can also be created with the amount of explained variance in mind from the start. For example, if we want to be able to explain at least 98 percent of the variance,the PCA object will be created as follows:
PCA对象在起初可以生成代表大量可解释变异的维度,例如,我们打算代表98%的可解释变异,PCA可以像下面这样生成:
代码语言:javascript复制pca = decomposition.PCA(n_components=.98)
iris_X_prime = pca.fit(iris_X)
pca.explained_variance_ratio_.sum()
1.0
Since we wanted to explain variance slightly more than the two component examples, a third was included.
当我们想要把可解释变异变得比2个组件更小,可以增加到三个。
解释变异
解释变异( Explained variance)是根据误差的方差计算得到的: