Evaluating the linear regression model评估线性回归模型

2020-04-21 11:12:24 浏览数 (1)

In this recipe, we'll look at how well our regression fits the underlying data. We fit a regression in the last recipe, but didn't pay much attention to how well we actually did it. The first question after we fit the model was clearly "How well does the model fit?" In this recipe, we'll examine this question.

在这部分,我们将观察我们的回归拟合未知数据的情况,我们在上一节拟合了一个回归方程,但是没有太过留意我们实际运用它时的表现如何。我们拟合过模型以后,第一个问题很清晰:模型的拟合程度怎么样?在这节,我们将检验这个问题。

Getting ready准备工作:

Let's use the lr object and boston dataset—reach back into your code from the Fitting a line through data recipe. The lr object will have a lot of useful methods now that the model has been fit.

让我们使用lr对象和波士顿数据集-回顾你拟合一条穿过数据的直线的那部分代码,在经过模型拟合后,lr对象将会有很多有用的方法

How to do it...怎么做

There are some very simple metrics and plots we'll want to look at as well. Let's take another look at the residual plot from the last chapter:

这里有一些简单的指标和我们希望看到的图,我们换个角度再看一下,上一节已绘制的图。

代码语言:javascript复制
import matplotlib.pyplot as plt
import numpy as np
f = plt.figure(figsize=(7, 5))
ax = f.add_subplot(111)
ax.hist(boston.target - predictions, bins=50)
ax.set_title("Histogram of Residuals.")

If you're using IPython Notebook, use the %matplotlib inline command to render the plots inline. If you're using a regular interpreter, simply type f.savefig('myfig.png') and the plot will be saved for you.

如果你用的IPython Notebook,记得使用%matplotlib inline命令来在行间呈现图像,如果你使用普通的解释器,只要输入f.savefig('myfig.png'),然后图形就为你保存下来了。

【Plotting is done via matplotlib. This isn't the focus of this book,but it's useful to plot your results, so we'll show some basic plotting.】

图形已经通过matplotlib画好了,这不是专业画图的书,但是画出你的结果非常有用,所以我们将展示一些基础的图形。

The following is the histogram showing the output:以下是输出的直方图:

Like I mentioned previously, the error terms should be normal, with a mean of 0. The residuals are the errors; therefore, this plot should be approximately normal. Visually, it's a good fit, though a bit skewed. We can also look at the mean of the residuals, which should be very close to 0:

像我之前提到的那样,误差分布应该是正态的,并且以0为均值。剩余的是误差。因此,这幅图是近似正态图形,可见,它拟合的不错,尽管有一点点偏态,我们也可以观察他均值的残差,它应该是接近0.

代码语言:javascript复制
np.mean(boston.target - predictions)
3.033146856209123e-15

Clearly, we are very close.明显,它很接近。

Another plot worth looking at is a Q-Q plot. We'll use SciPy here because it has a built-in probability plot:

另一个值得观察的是Q-Q图,我们在这使用SciPy,因为它内置概率图就可以绘制:

代码语言:javascript复制
from scipy.stats import probplot
f = plt.figure(figsize=(7, 5))
ax = f.add_subplot(111)
probplot(boston.target - predictions, plot=ax)

The following screenshot shows the probability plot:以下屏幕中显示的是概率图:

Here, the skewed values we saw earlier are a bit clearer.这里我们可以看到,在最初的一段偏离值较清晰

We can also look at some other metrics of the fit; mean squared error (MSE) and mean absolute deviation (MAD) are two common metrics. Let's define each one in Python and use them. Later in the book, we'll look at how scikit-learn has built-in metrics to evaluate the regression models:

我们还可以观察拟合其他特征:MSE和MAD是两个常用指标,让我们在python中分别定义和使用,稍后部分,我们将看到scikit-learn的内建指标如何评估线性模型。

代码语言:javascript复制
def MSE(target, predictions):
    squared_deviation = np.power(target - predictions, 2)
    return np.mean(squared_deviation)
MSE(boston.target, predictions)
21.894831181729202

def MAD(target, predictions):
    absolute_deviation = np.abs(target - predictions)
    return np.mean(absolute_deviation)
MAD(boston.target, predictions)
3.2708628109003115

How it works...它是怎么做的

The formula for MSE is very simple:MSE的方程非常简单

It takes each predicted value's deviance from the actual value, squares it, and then averages all the squared terms. This is actually what we optimized to find the best set of coefficients for linear regression. The Gauss-Markov theorem actually guarantees that the solution to linear regression is the best in the sense that the coefficients have the smallest expected squared error and are unbiased. In the Using ridge regression to overcome linear regression's shortfalls recipe, we'll look at what happens when we're okay with our coefficients being biased.

它计算每一个预测值与实际值的偏差,然后平方,然后平均所有的平方项。这能够使得我们寻找的线性回归的系数集合达到最优化。Gauss-Markov theorem高斯马可夫定理保证解是线性回归理解的最优解,系数有最小的期望方差以及无偏估计量。在使用岭回归来克服线性回归的残差,我们来看一看当我们得到无偏估计量系数时发生了什么。

MAD is the expected error for the absolute errors:

MAD isn't used when fitting the linear regression, but it's worth taking a look at. Why?Think about what each one is doing and which errors are more important in each case.For example, with MSE, the larger errors get penalized more than the other terms becauseof the square term.

MAD不会被用于拟合线性回归,但是值得一看,为什么呢?想想每一个函数在做什么,和哪一种误差在例子中起关键作用。例如,MSE,因为平方的缘故,较大的误差会比其他情况受到更多惩罚。

There's more...扩展阅读

One thing that's been glossed over a bit is the fact that the coefficients themselves are random variables, and therefore, they have a distribution. Let's use bootstrapping to look at the distribution of the coefficient for the crime rate. Bootstrapping is a very common technique to get an understanding of the uncertainty of an estimate:

因子的相关系数是随机变量,它们被平滑分布,让我们使用bootstrapping来看一下犯罪率的系数分布,bootstrapping是一个常规技术来了解估计的不确定性

代码语言:javascript复制
n_bootstraps = 1000
len_boston = len(boston.target)
subsample_size = np.int(0.5*len_boston)
subsample = lambda: np.random.choice(np.arange(0, len_boston),size=subsample_size)
coefs = np.ones(n_bootstraps) #pre-allocate the space for the coefs
for i in range(n_bootstraps):
    subsample_idx = subsample()
    subsample_X = boston.data[subsample_idx]
    subsample_y = boston.target[subsample_idx]
    lr.fit(subsample_X, subsample_y)
    coefs[i] = lr.coef_[0]

Now, we can look at the distribution of the coefficient:现在让我们看一下系数分布:

代码语言:javascript复制
import matplotlib.pyplot as plt
f = plt.figure(figsize=(7, 5))
ax = f.add_subplot(111)
ax.hist(coefs, bins=50)
ax.set_title("Histogram of the lr.coef_0.")

The following is the histogram that gets generated:以下是生成的直方图:

We might also want to look at the bootstrapped confidence interval:我们还想看一下bootstrapped的置信区间

代码语言:javascript复制
np.percentile(coefs, 2.5, 97.5)
array(-0.18566145, 0.03142513)

This is interesting; there's actually reason to believe that the crime rate might not have an impact on the home prices. Notice how zero is within CI, which means that it may not play a role.It's also worth pointing out that bootstrapping can lead to a potentially better estimation for coefficients because the bootstrapped mean with converge to the true mean is faster than the coefficient found using regular estimation when in the limit.

非常有趣,这有明确的证据相信犯罪率也行并没有对房价产生巨大的影响。注意到在CI中的0,意味着0不起任何作用。但还是值得指出,bootstrapping能够通向对系数的潜在评估,因为当在有限数据的情况下,自举法的均值聚合向真实均值的速度比使用常规评估快很多

0 人点赞