Finding the closest objects in the feature space在特征空间中找到最接近的对象

2020-04-24 12:02:13 浏览数 (1)

Sometimes, the easiest thing to do is to just find the distance between two objects. We just need to find some distance metric, compute the pairwise distances, and compare the outcomes to what's expected.

通常,最简单的事情是找到两个对象之间的距离。我们只需要找到一些距离指标,计算成对的距离,使其与预测的输出作比较。

Getting ready准备工作

A lower-level utility in scikit-learn is sklearn.metrics.pairwise . This contains serve functions to compute the distances between the vectors in a matrix X or the distances between the vectors in X and Y easily.

在scikit-learn中的一个低级实用的方法是sklearn.metrics.pairwise。它包含数个函数来计算矩阵X中向量之间的距离,或者简单的X、Y之间的距离。

This can be useful for information retrieval. For example, given a set of customers with attributes of X, we might want to take a reference customer and find the closest customers to this customer. In fact, we might want to rank customers by the notion of similarity measured by a distance function. The quality of the similarity depends upon the feature space selection as well as any transformation we might do on the space.

这能被用于信息检索。例如,给定一个包含X个特征的客户集合,我们可能想找到一个客户,然后找到最接近这个客户的客户。事实上,我们可能想通过距离函数来测定相似情况来排序客户。相似性的质量取决于向量空间的选择以及我们可能对其做的转换。

We'll walk through several different scenarios of measuring distance.我们将了解不同的测算距离的方法。

How to do it...怎么做

We will use the pairwise_distances function to determine the "closeness" of objects. Remember that the closeness is really just similarity that we use our distance function to grade.

我们将使用pairwise_distances函数来测定对象之间的接近程度。记住这接近程度就与我们使用的用于分级的距离函数是一样的。

First, let's import the pairwise distance function from the metrics module and create a dataset to play with:

首先,我们从metrics模型中导入pairwise distance函数并生成一个相应的数据集:

代码语言:javascript复制
from sklearn.metrics import pairwise
from sklearn.datasets import make_blobs
points, labels = make_blobs()

This simplest way to check the distances is pairwise_distances :最简单的检查距离的方法是pairwise_distances:

代码语言:javascript复制
distances = pairwise.pairwise_distances(points)

distances is an N x N matrix with 0s along the diagonals. In the simplest case, let's see the distances between each point and the first point:

distances是一个对角线均为0的N*N矩阵,举个最简单的例子,我们看看第一个点和其点的距离:

代码语言:javascript复制
np.diag(distances) [:5]
array([ 0., 0., 0., 0., 0.])

Now we can look for points that are closest to the first point in points :现在我们寻找离第一个点最近的点。

代码语言:javascript复制
distances[0][:5]
array([ 0., 11.82643041,1.23751545, 1.17612135, 14.61927874])

Ranking the points by closeness is very easy with np.argsort :使用np.argsort按距离排列点非常容易

代码语言:javascript复制
ranks = np.argsort(distances[0])
ranks[:5]
array([ 0, 27, 98, 23, 67])

The great thing about argsort is that now we can sort our points matrix to get the actual points:

通过排序点矩阵来得到真实的点集是argsort做的最有用的事情。

代码语言:javascript复制
>>> points[ranks][:5]
array([[ 2.31046218, -7.74549549],
       [ 2.0834554 , -8.65921727],
       [ 1.72362134, -6.86071402],
       [ 1.7144178 , -6.77084712],
       [ 1.31671351, -7.04678261]])

It's useful to see what the closest points look like. Other than some assurances,this works as intended:

看一看最近的点是哪些非常有用,除了保险以外,看它是否按计划行事。

How it works...怎么做的

Given some distance function, each point is measured in a pairwise function. The default is the Euclidian distance, which is as follows:

给出一些距离函数,每一个点都被pairwise函数测量,默认的就是欧拉距离:

Verbally, this takes the difference between each component of the two vectors, squares the difference, sums them, and then takes the square root. This looks very familiar as we used something very similar to this when looking at the mean-squared error. If we take the square root, we have the same thing. In fact, a metric used often is root-mean-square deviation (RMSE), which is just the applied distance function.

口头上说,这展示两个向量的每个组成部分之间的不同,做平方差后再相加,然后开根号。这看起来很熟悉,很像均方误差。如果我们开根号,我们得到同样的东西,事实上,均方差开根号是常用的距离函数。

In Python, this looks like the following:在python中,方法如下:

代码语言:javascript复制
def euclid_distances(x, y):
    return np.power(np.power(x - y, 2).sum(), .5)
euclid_distances(points[0], points[1])
4.8537804504870765

There are several other functions available in scikit-learn, but scikit-learn will also use distance functions of SciPy. At the time of writing this book, the scikit-learn distance functions support sparse matrixes. Check out the SciPy documentation for more information on the distance functions:

在scikit-learn中还有几种其他的可选函数,但是scikit-learn也常常使用scipy的距离函数,在写本书的时候,scikit-learn的距离函数支持稀疏矩阵,查看scipy的文档来获取更多关于距离函数的知识:

1、 cityblock 布洛克距离

2、 cosine 余弦

3、 euclidean 欧氏距离

4、 l1

5、 l2

6、 manhattan 曼哈顿距离

We can now solve problems. For example, if we were standing on a grid at the origin, and the lines were the streets, how far will we have to travel to get to point (5, 5)?.

我们现在能解决问题了,比如,如果我们站在网格的起点,线就像是街道,我们走向(5,5)需要走多远。

代码语言:javascript复制
pairwise.pairwise_distances([[0, 0], [5, 5]], metric='cityblock')[0]
array([ 0., 10.])

There's more...扩展阅读

Using pairwise distances, we can find the similarity between bit vectors. It's a matter of finding the hamming distance, which is defined as follows:

使用pairwise distances,我们可以找到不同向量之间的相似性。一个问题是找到汉明距离。如下所示:

Use the following command:使用如下代码:

代码语言:javascript复制
X = np.random.binomial(1, .5, size=(2, 4)).astype(np.bool)
X
array([[False, True, False, False],
       [False, False, False, True]], dtype=bool)
pairwise.pairwise_distances(X, metric='hamming')
array([[ 0. , 0.25],
       [ 0.25, 0. ]])

0 人点赞