用基于概率的高斯混合模型聚类
In KMeans, we assume that the variance of the clusters is equal. This leads to a subdivision of space that determines how the clusters are assigned; but, what about a situation where the variances are not equal and each cluster point has some probabilistic association with it?
在KMeans,我们假设聚类方差是相等的,这导致空间分割来主导分配的类,但是当方差不相等的时候或者每个聚集点有一些概率联系呢?
Getting ready准备工作
There's a more probabilistic way of looking at KMeans clustering. Hard KMeans clustering is the same as applying a Gaussian Mixture Model with a covariance matrix, S, which can be factored to the error times of the identity matrix. This is the same covariance structure for each cluster. It leads to spherical clusters.
一个更加基于概率的方法来看待KMeans聚类,Hard KMeans clustering的用法就和高斯混合模型处理协方差矩阵一样,S能被分解因子为误差次数的单位向量,这与每个聚类的协方差结构相似,这导致球形分类。
However, if we allow S to vary, a GMM can be estimated and used for prediction. We'll look at how this works in a univariate sense, and then expand to more dimensions.
然而,如果我们允许S改变,一个GMM能被估计和使用来预测,我们将看看单变量的情况它是如何工作的,然后推广到多维情况。
How to do it...怎么做
First, we need to create some data. For example, let's simulate heights of both women and men. We'll use this example throughout this recipe. It's a simple example, but hopefully, will illustrate what we're trying to accomplish in an N dimensional space, which is a little easier to visualize:
首先,我们需要生成数据,例如,让我们模拟女性和男性的身高,我们将在这部分使用该数据集,这是一个简单的例子,但是幸运的是,它能够很好的显示我们将在N维空间所完成的工作。并且,它的可视化会简单一些。
代码语言:javascript复制import numpy as np
N = 1000
in_m = 72
in_w = 66
s_m = 2
s_w = s_m
m = np.random.normal(in_m, s_m, N)
w = np.random.normal(in_w, s_w, N)
from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Histogram of Heights")
ax.hist(m, alpha=.5, label="Men");
ax.hist(w, alpha=.5, label="Women");
ax.legend()
The following is the output:如下图所示
Next, we might be interested in subsampling the group, fitting the distribution, and then predicting the remaining groups:
然后,我们可能对分层抽样感兴趣,拟合这个分布,然后预测剩下的组:
代码语言:javascript复制random_sample = np.random.choice([True, False], size=m.size)
m_test = m[random_sample]
m_train = m[~random_sample]
w_test = w[random_sample]
w_train = w[~random_sample]
Now we need to get the empirical distribution of the heights of both men and women based on the training set:
现在我们需要得到基于训练集的女人和男人的身高的试验分布
代码语言:javascript复制from scipy import stats
m_pdf = stats.norm(m_train.mean(), m_train.std())
w_pdf = stats.norm(w_train.mean(), w_train.std())
For the test set, we will calculate based on the likelihood that the data point was generated from either distribution, and the most likely distribution will get the appropriate label assigned. We will, of course, look at how accurate we were:
对于测试集,我们将基于可能性计算每一个分布区间生成的数据点,然后最可能性的分布将被分配合适的标签。当然,我们要查看准确率怎么样。
代码语言:javascript复制m_pdf.pdf(m[0])
0.11686474914470572
w_pdf.pdf(m[0])
0.04121183637930949
Notice the difference in likelihoods.注意可能性之间的不同
Assume that we guess situations when the men's probability is higher, but we overwrite them if the women's probability is higher:
假设我们猜测男性身高应该会更高些,但是如果女性身高更高的话我们覆盖它。
代码语言:javascript复制guesses_m = np.ones_like(m_test)
guesses_m[m_pdf.pdf(m_test) < w_pdf.pdf(m_test)] = 0
Obviously, the question is how accurate we are. Since guesses_m will be 1 if we are correct,and 0 if we aren't, we take the mean of the vector and get the accuracy:
显然,问题是我们的准确性怎么样。当我们正确时,guesses_m是1,如果错了就是0,我们用向量的均值,并得到准确率。
代码语言:javascript复制guesses_m.mean()
0.927536231884058
Not too bad! Now, to see how well we did with for the women's group, use the following commands:
还行,现在,来看一下我们如何处理女性的组,使用以下代码:
代码语言:javascript复制guesses_w = np.ones_like(w_test)
guesses_w[m_pdf.pdf(w_test) > w_pdf.pdf(w_test)] = 0
guesses_w.mean()
0.927536231884058
Let's allow the variance to differ between groups. First, create some new data:
让我们允许两个组之间的方差不同,首先,生成些新数据:
代码语言:javascript复制s_m = 1
s_w = 4
m = np.random.normal(in_m, s_m, N)
w = np.random.normal(in_w, s_w, N)
Then, create a training set:然后,生成一个训练集
代码语言:javascript复制m_test = m[random_sample]
m_train = m[~random_sample]
w_test = w[random_sample]
w_train = w[~random_sample]
f, ax = plt.subplots(figsize=(7, 5))
ax.set_title("Histogram of Heights")
ax.hist(m_train, alpha=.5, label="Men");
ax.hist(w_train, alpha=.5, label="Women");
ax.legend()
Let's take a look at the difference in variances between the men and women:
让我们看看男性和女性身高在不同方差下的不同
Now we can create the same PDFs:让我们生成同样的PDF
代码语言:javascript复制m_pdf = stats.norm(m_train.mean(), m_train.std())
w_pdf = stats.norm(w_train.mean(), w_train.std())
The following is the output:以下为输出:
You can imagine this in a multidimensional space:你能够在多维空间想象这个:
代码语言:javascript复制class_A = np.random.normal(0, 1, size=(100, 2))
class_B = np.random.normal(4, 1.5, size=(100, 2))
f, ax = plt.subplots(figsize=(7, 5))
ax.scatter(class_A[:,0], class_A[:,1], label='A', c='r')
ax.scatter(class_B[:,0], class_B[:,1], label='B')
The following is the output:如下图所示
How it works...怎么做的:
Okay, so now that we've looked at how we can classify points based on distribution, let's look at how we can do this in scikit-learn:
好的,现在我们看看我们如何基于分布分类点,让我们看一看用scikit-learn怎么做:
代码语言:javascript复制from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2)
X = np.row_stack((class_A, class_B))
y = np.hstack((np.ones(100), np.zeros(100)))
Since we're good little data scientists, we'll create a training set:自从我们成为好的小数据科学家,我们需要生成训练集:
代码语言:javascript复制train = np.random.choice([True, False], 200)
gmm.fit(X[train])
GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
means_init=None, n_components=2, n_init=1, precisions_init=None,
random_state=None, reg_covar=1e-06, tol=0.001, verbose=0,
verbose_interval=10, warm_start=False, weights_init=None)
Fitting and predicting is done in the same way as fitting is done for many of the other objects in scikit-learn:
在scikit-learn拟合并预测的方法和拟合其它对象的方法一样
代码语言:javascript复制gmm.fit(X[train])
gmm.predict(X[train])[:5]
array([0, 0, 0, 0, 0])
There are other methods worth looking at now that the model has been fit.For example, using score_samples , we can actually get the per-sample likelihood for each label.
当模型拟合完成以后,这还有其他方法值得看一看。例如,我们可以使用每一个样本的与标签的相似性。