Assessing cluster correctness 估算聚类正确性
We talked a little bit about assessing clusters when the ground truth is not known. However, we have not yet talked about assessing KMeans when the cluster is known. In a lot of cases, this isn't knowable; however, if there is outside annotation, we will know the ground truth,or at least the proxy, sometimes.
我们讨论了一点当未知事实时候的聚类评估,然而我们还没有讨论过当类别已知时KMeans的评估。与很多原因,然而如果外界有声明,我们将了解一部分事实。
Getting ready 准备工作
So, let's assume a world where we have some outside agent supplying us with the ground truth.
让我们假设一个外界给我们提供真实值的世界
We'll create a simple dataset, evaluate the measures of correctness against the ground truth in several ways, and then discuss them:
我们生产一个简单的数据集,评估几种不同方法与真实值的差距,然后讨论它们。
代码语言:javascript复制from sklearn import datasets
from sklearn import cluster
blobs, ground_truth = datasets.make_blobs(1000, centers=3, cluster_std=1.75)
How to do it...
Before we walk through the metrics, let's take a look at the dataset:
在我们使用metrics前,我们来看一下数据集
代码语言:javascript复制f, ax = plt.subplots(figsize=(7, 5))
colors = ['r', 'g', 'b']
for i in range(3):
p = blobs[ground_truth == i]
ax.scatter(p[:,0], p[:,1], c=colors[i], label="Cluster {}".format(i))
ax.set_title("Cluster With Ground Truth") >>> ax.legend()
f.savefig("9485OS_03-16")
The following is the output: 如图:
in order to fit a KMeans model we'll create a KMeans object from the cluster module:
为了拟合KMeans,我们生成一个KMeans类
代码语言:javascript复制kmeans = cluster.KMeans(n_clusters=3)
kmeans.fit(blobs)
KMeans(algorithm='auto', copy_x=True, init='k-means ', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
kmeans.cluster_centers_
array([[-2.02602833, 6.15144638],
[ 8.35219959, -8.36463419],
[ 3.32319765, -9.96877357]])
Now that we've fit the model, let's have a look at the cluster centroids: 现在我们拟合了一个模型,我们来看看聚类中心
代码语言:javascript复制f, ax = plt.subplots(figsize=(7, 5))
colors = ['r', 'g', 'b']
for i in range(3):
p = blobs[ground_truth == i]
ax.scatter(p[:,0], p[:,1], c=colors[i], label="Cluster {}".format(i))
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=100, color='black', label='Centers')
ax.set_title("Cluster With Ground Truth")
ax.legend()
f.savefig("9485OS_03-17")
The following is the output: 如下图
Now that we can view the clustering performance as a classification exercise, the metrics that are useful in its context are also useful here:
现在我们能看到聚类表现, metrics在内容中有用
代码语言:javascript复制for i in range(3):
print((kmeans.labels_ == ground_truth)[ground_truth == i] .astype(int).mean() )
1.0
0.057057057057057055
0.10810810810810811
Clearly, we have some backward clusters. So, let's get this straightened out first, and then we'll look at the accuracy:
清楚了。我们有了一些聚类的背景,现在我们直接用来看看正确率。
代码语言:javascript复制new_ground_truth = ground_truth.copy()
new_ground_truth[ground_truth == 0] = 2
new_ground_truth[ground_truth == 2] = 0
for i in range(3):
print (kmeans.labels_ == new_ground_truth)[ground_truth == i] .astype(int).mean()
0.0
0.057057057057057055
0.0
So, we're roughly correct 90 percent of the time. The second measure of similarity we'll look at is the mutual information score:
我们可以看到有90以上的正确率,第二步我们将看以下信息得分。
代码语言:javascript复制from sklearn import metrics
metrics.normalized_mutual_info_score(ground_truth, kmeans.labels_)
0.8293917499613002
As the score tends to be 0, the label assignments are probably not generated through similar processes; however, the score being closer to 1 means that there is a large amount of agreement between the two labels.
若得分趋向于0,则标签分配可能没有通过相似的步骤,得分趋向于1,两者有很大的认同之处。
For example, let's look at what happens when the mutual information score itself:
然我们看下一下它与自己的评分
代码语言:javascript复制metrics.normalized_mutual_info_score(ground_truth, ground_truth)
1.0
Given the name, we can tell that there is probably an unnormalized mutual_info_score: 这里有一个未正则化的得分
代码语言:javascript复制metrics.mutual_info_score(ground_truth, kmeans.labels_)
0.9108190106264438
These are very close; however, normalized mutual information is the mutual information divided by the root of the product of the entropy of each set truth and assigned label.
这非常接近,正则化信息
There's more... 扩展阅读
One cluster metric we haven't talked about yet and one that is not reliant on the ground truth is inertia. It is not very well documented as a metric at the moment. However, it is the metric that KMeans minimizes.
Inertia is the sum of the squared difference between each point and its assigned cluster. We can use a little NumPy to determine this:
代码语言:javascript复制kmeans.inertia_
5933.023391165296
Using MiniBatch KMeans to handle more data使用小批量KMeans来处理更多数据
KMeans is a nice method to use; however, it is not ideal for a lot of data. This is due to the complexity of KMeans. This said, we can get approximate solutions with much better algorithmic complexity using KMeans.
KMeans很好用,但是它没法处理大数据,这是因为它的复杂度问题,这说明我们能使用比KMeans更好的算法复杂度来得到近似的解
Getting ready准备工作
MiniBatch KMeans is a faster implementation of KMeans. KMeans is computationally very expensive; the problem is NP-hard.However, using MiniBatch KMeans, we can speed up KMeans by orders of magnitude. This is achieved by taking many subsamples that are called MiniBatches. Given the convergence properties of subsampling, a close approximation to regular KMeans is achieved, given good initial conditions.
MiniBatch KMeans是KMeans的一个更快速的执行办法,KMeans计算非常的昂贵,问题就是NP-hard(非确定性多项式复杂度)然而,使用MiniBatch KMeans,我们能够比KMeans的提升数量级的速度。这是通过分成很多MiniBatch子样本来达到的。在给定好的条件下,由于子样本收敛性,得到一种接近原本KMeans的近似值来实现的。
How to do it...怎么做
Let's do some very high-level profiling of MiniBatch clustering. First, we'll look at the overall speed difference, and then we'll look at the errors in the estimates:
让我们做些高层次的MiniBatch clustering资料收集,首先,我们看一看整个过程速度的不同,然后看一看估计误差。
代码语言:javascript复制from sklearn.datasets import make_blobs
blobs, labels = make_blobs(int(1e6), 3)
from sklearn.cluster import KMeans, MiniBatchKMeans
kmeans = KMeans(n_clusters=3)
minibatch = MiniBatchKMeans(n_clusters=3)
【Understand that these metrics are meant to expose the issue.Therefore, great care is taken to ensure the highest accuracy of the benchmarks. There is a lot of information available on this topic;if you really want to get to the heart of why MiniBatch KMeans is better at scaling, it will be a good idea to review what's available.】
【理解这些指标能够揭示问题,因此,能够注重更高准确性的基准。这章会包含很多可获得的信息,如果你想得到MiniBatch KMeans为何能善于缩放的中心,审查哪些可获得的信息会是号好主意】
Now that the setup is complete, we can measure the time difference:现在设置完成了,我们能测量时间的不同了:
代码语言:javascript复制%time kmeans.fit(blobs) #IPython Magic python的魔法函数
CPU times: user 6.19 s, sys: 328 ms, total: 6.52 s
Wall time: 3.32 s
KMeans(algorithm='auto', copy_x=True, init='k-means ', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
%time minibatch.fit(blobs)
CPU times: user 3.18 s, sys: 28.8 ms, total: 3.21 s
Wall time: 3.04 s
MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means ',
init_size=None, max_iter=100, max_no_improvement=10,
n_clusters=3, n_init=3, random_state=None,
reassignment_ratio=0.01, tol=0.0, verbose=0)
There's a large difference in CPU times. The difference in clustering performance is shown as follows:
这里同CPU时间有很大不同,在聚类表现的不同将如下所示:
代码语言:javascript复制kmeans.cluster_centers_[0]
array([ 1.10522173, -5.59610761, -8.35565134])
minibatch.cluster_centers_[0]
array([ 1.12071187, -5.61215116, -8.32015587])
The next question we might ask is how far apart the centers are:下一个问题是:中心的距离有多远?
代码语言:javascript复制from sklearn.metrics import pairwise
pairwise.pairwise_distances(kmeans.cluster_centers_[0],minibatch.cluster_centers_[0])
array([[ 0.03305309]])
This seems to be very close. The diagonals will contain the cluster center differences:
这看起来很近,对角线将包含不同聚类的中心
代码语言:javascript复制>>> np.diag(pairwise.pairwise_distances(kmeans.cluster_centers_,minibatch.cluster_centers_))
array([5.11837284, 0.02854318, 5.08491333])
How it works...它怎么做的
The batches here are key. Batches are iterated through to find the batch mean; for the next iteration, the prior batch mean is updated in relation to the current iteration. There are several options that dictate the general KMeans' behavior and parameters that determine how MiniBatch KMeans gets updated.
批的数量在这里是关键,批能够通过找到批的中心来迭代,为了下一步迭代,上一步迭代的批中心被更新成正确的迭代值。这有几种选择,来决定常规KMeans的行为和参数,能决定MiniBatch KMeans如何更新。
The batch_size parameter determines how large the batches should be. Just for fun, let's run MiniBatch; however, this time we set the batch size the same as the dataset size:
batch_size参数决定批应该有多大,为了更有乐趣,让我们运行MiniBatch,无论怎么样,这次我们设置批的数量和数据集的大小一样。
代码语言:javascript复制minibatch = MiniBatchKMeans(batch_size=len(blobs))
%time minibatch.fit(blobs)
CPU times: user 34.6 s, sys: 3.17 s, total: 37.8 s Wall time: 44.6 s
Clearly, this is against the spirit of the problem, but it does illustrate an important point.Choosing poor initial conditions can affect how well models, particularly clustering models,converge. With MiniBatch KMeans, there is no guarantee that the global optimum will be achieved.
清晰的看到,这与问题的精神相悖,但是它说明了很重要的一点,选择不合适的初始条件会如何影响模型聚合,尤其是聚类模型,所以使用MiniBatch KMeans,不能保证达到全局最优结果。