Using k-NN for regression使用K-NN来做回归模型

2020-04-24 12:02:25 浏览数 (1)

Regression is covered elsewhere in the book, but we might also want to run a regression on "pockets" of the feature space. We can think that our dataset is subject to several data processes. If this is true, only training on similar data points is a good idea.

回归模型会出现在本书的任何地方,但是我们可能想要在一个向量空间的包中运行回归,我们可以想象我们的数据集服从多个数据过程,如果这是真的,只在相似的数据点训练会是个好方法。

Getting ready准备工作

Our old friend, regression, can be used in the context of clustering. Regression is obviously a supervised technique, so we'll use k-Nearest Neighbors (k-NN) clustering rather than KMeans.

我们的老朋友回归模型能够被用于聚类的内容,回归明显是监督学习技术,所以我们将使用K近邻K-NN聚类来代替KMeans。

For the k-NN regression, we'll use the K closest points in the feature space to build the regression rather than using the entire space as in regular regression.

对于K-NN回归,我们使用K个在向量空间中最近的点来建立回归模型以代替传统回归中使用整个空间。

How to do it…怎么做

For this recipe, we'll use the iris dataset. If we want to predict something such as the petal width for each flower, clustering by iris species can potentially give us better results. The k-NN regression won't cluster by the species, but we'll work under the assumption that the Xs will be close for the same species, or in this case, the petal length.

在这部分,我们将使用iris数据集,如果我们想要预测比如像每朵花的花瓣宽度,通过iris种类来聚类能够潜在的给我们一些好的结果。K-NN回归不能通过种类聚类,但是我们将假设Xs将要接近与它相似的种类,或者在这个例子中,花瓣宽度。

We'll use the iris dataset for this recipe:在这部分,我们将使用iris数据集

代码语言:javascript复制
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
iris.feature_names
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

We'll try to predict the petal length based on the sepal length and width. We'll also fit a regular linear regression to see how well the k-NN regression does in comparison:

我们将要依靠花萼的长度和宽度预测花瓣宽度,我们也将拟合一个传统的线性回归来对比看看K-NN回归表现得怎么样。

代码语言:javascript复制
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)
print "The MSE is: {:.2}".format(np.power(y - lr.predict(X),2).mean())
The MSE is: 0.046

Now, for the k-NN regression, use the following code:现在,对于K-NN回归,使用以下代码:

代码语言:javascript复制
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=10)
knnr.fit(X, y)
print "The MSE is: {:.2}".format(np.power(y - knnr.predict(X),2).mean())
The MSE is: 0.022

Let's look at what the k-NN regression does when we tell it to use the closest 10 points for regression:

让我们看看当我们告诉它使用最近的10个点来进行回归后,K-NN做了什么。

代码语言:javascript复制
f, ax = plt.subplots(nrows=2, figsize=(7, 10))
ax[0].set_title("Predictions")
ax[0].scatter(X[:, 0], X[:, 1], s=lr.predict(X)*80, label='LR Predictions', color='c', edgecolors='black')
ax[1].scatter(X[:, 0], X[:, 1], s=knnr.predict(X)*80, label='k-NN Predictions', color='m', edgecolors='black')
ax[0].legend()
ax[1].legend()

The following is the output:如下图所示

It might be completely clear that the predictions are close for the most part, but let's look at the predictions for the Setosa species as compared to the actuals:

这很清晰,在大部分的区域看起来好像与预测值很接近,但是让我们看看山鸢尾种的预测值和实际值的区别:

代码语言:javascript复制
setosa_idx = np.where(iris.target_names=='setosa')
setosa_mask = iris.target == setosa_idx[0]
y[setosa_mask][:5]
array([0., 0., 0., 0., 0.])
knnr.predict(X)[setosa_mask][:5]
array([0., 0., 0., 0., 0.])
lr.predict(X)[setosa_mask][:5]
array([-0.08254936, -0.04012845, -0.04862768,  0.01229986, -0.07536672])

Looking at the plots again, the Setosa species (upper-left cluster) is largely overestimated by linear regression, and k-NN is fairly close to the actual values.

再看看图,山鸢尾种(聚类的左上角)被线性回归很大的高估了,同时K-NN非常接近真实值(我的不太一样)。

How it works...怎么工作的

The k-NN regression is very simply calculated taking the average of the k closest point to the point being tested.

Let's manually predict a single point:

K-NN回归简单的计算k个最近的点与被测试的点距离的平均值,让我们手动预测一个单值:

代码语言:javascript复制
example_point = X[0]

Now, we need to get the 10 closest points to our example_point :现在我们需要得到10个与example_point最近的点

代码语言:javascript复制
from sklearn.metrics import pairwise
distances_to_example = pairwise.pairwise_distances(X)[0]
ten_closest_points = X[np.argsort(distances_to_example)][:10]
ten_closest_y = y[np.argsort(distances_to_example)][:10]
ten_closest_y.mean()
0.28000

We can see that this is very close to what was expected.我们能够看到,它与预测值的非常近。

0 人点赞