Once you start using ridge regression to make predictions or learn about relationships in the system you're modeling, you'll start thinking about the choice of alpha.For example, using OLS regression might show some relationship between two variables;however, when regularized by some alpha, the relationship is no longer significant. This can be a matter of whether a decision needs to be taken.
一旦你开始使用岭回归来做预测或者学习你的模型系统间的关系,你就得开始思考如何选择α。比如,使用OLS(最小二乘法)回归可能显示两个变量之间的关系,然而,当用α正则化后,相关性将不再显著,这将是一个是否需要做决定的问题。
Getting ready准备工作
This is the first recipe where we'll tune the parameters for a model. This is typically done by cross-validation. There will be recipes laying out a more general way to do this in later recipes,but here we'll walkthrough to be able to tune ridge regression.
这是我们第一次调整模型参数。通常采用交叉验证。这将是一个为后面的方法引出更普遍方法的步骤。在这我们先用来调节岭回归。
If you remember, in ridge regression, the gamma parameter is typically represented as alpha in scikit-learn when calling RidgeRegression ; so, the question that arises is what the best alpha is. Create a regression dataset, and then let's get started:
如果你还记得,在岭回归中当调用RidgeRegression时,γ参数通常被scikit-learn中的α代替。所以问题就成了最好的α是多少,生成一个回归数据集,让我们开始吧。
代码语言:javascript复制from sklearn.datasets import make_regression
reg_data,reg_target=make_regression(n_samples=100,n_features=2, effective_rank=1, noise=10)
How to do it...怎么做
In the linear_models module, there is an object called RidgeCV , which stands for ridge cross-validation. This performs a cross-validation similar to leave-one-out cross-validation (LOOCV).Under the hood, it's going to train the model for all samples except one. It'll then evaluate the error in predicting this one test case:
在线性模型模块当中,有一个叫做RidgeCV的对象,代替岭回归交叉验证。它执行一种交叉验证类似留一交叉验证。在背后,模型将训练除了留下的一个以外的所有样本。它将通过预测这个测试样例来评估误差。
代码语言:javascript复制from sklearn.linear_model import RidgeCV
rcv = RidgeCV(alphas=np.array([.1, .2, .3, .4]))
rcv.fit(reg_data, reg_target)
RidgeCV(alphas=array([0.1, 0.2, 0.3, 0.4]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
After we fit the regression, the alpha attribute will be the best alpha choice:拟合完回归后,α的属性将是可选择的最好的α值。
代码语言:javascript复制rcv.alpha_
0.10000000000000001
In the previous example, it was the first choice. We might want to hone in on something around .1 :
再上一个例子中,这是第一个选择,我们可能想训练0.1附近的值:
代码语言:javascript复制>>> rcv2 = RidgeCV(alphas=np.array([.08, .09, .1, .11, .12]))
>>> rcv2.fit(reg_data, reg_target)
RidgeCV(alphas=array([ 0.08, 0.09, 0.1 , 0.11, 0.12]), cv=None,fit_intercept=True, gcv_mode=None,
loss_func=None, normalize=False,score_func=None, scoring=None,store_cv_values=False)
>>> rcv2.alpha_
0.08
We can continue this hunt, but hopefully, the mechanics are clear.我们可以继续这样寻找,还好机器已经清楚了
How it works...它怎么工作的。
The mechanics might be clear, but we should talk a little more about the why and define what was meant by "best". At each step in the cross-validation process, the model scores an error against the test sample. By default, it's essentially a squared error. Check out the There's more... section for more details.
虽然机器可能很清楚了,但是我们应该多谈论一点为什么和定义最好意味着什么。在交叉估验证的每一步骤,模型对测试样例得出一个分值,通过定义,它其实是一个误差平方,通过扩展阅读,了解更多细节。
We can force the RidgeCV object to store the cross-validation values; this will let us visualize what it's doing:
我们能够让RidgeCV对象记录交叉验证的值,这将让我们能够看见它做了什么:
代码语言:javascript复制alphas_to_test = np.linspace(0.01, 1)
rcv3 = RidgeCV(alphas=alphas_to_test, store_cv_values=True)
rcv3.fit(reg_data, reg_target)
As you can see, we test a bunch of points (50 in total) between 0.01 and 1. Since we passed store_cv_values as true , we can access these values:
如你所见,我们通过测试一组在0.01到1之间总量为50个的点,当我们传入store_cv_values为True,我们能得到这些值。
代码语言:javascript复制rcv3.cv_values_.shape
(100, 50)
So, we had 100 values in the initial regression and tested 50 different alpha values. We now have access to the errors of all 50 values. So, we can now find the smallest mean error and choose it as alpha:
所以,我们有100个初始化回归的值和测试50个不同的α值。我们现在访问这50个值得到的误差,以便我们能找到最小均方误差和它的最优的α。
代码语言:javascript复制smallest_idx = rcv3.cv_values_.mean(axis=0).argmin()
alphas_to_test[smallest_idx]
0.030204081632653063
The question that arises is "Does RidgeCV agree with our choice?" Use the following command to find out:
现在问题变成了RidgeCV同意我们对最优α的选择吗?使用以下代码,来找到答案。
代码语言:javascript复制rcv3.alpha_
0.030204081632653063
Beautiful!漂亮!
It's also worthwhile to visualize what's going on. In order to do that, we'll plot the mean for all 50 test alphas.
这很值得看一看发生了什么,为了验证这件事,我们画出50个测试的α值得到的均值。
There's more...扩展阅读
If we want to use our own scoring function, we can do that as well. Since we looked up MAD before, let's use it to score the differences. First, we need to define our loss function:
如果我们使用我们自己的打分函数,我们也能做到,当我们查找异常值检测,我们使用它来给出不同的分数,首先,我们定义我们的损失函数。
代码语言:javascript复制def MAD(target, predictions):
absolute_deviation = np.abs(target - predictions)
return absolute_deviation.mean()
After we define the loss function, we can employ the make_scorer function in sklearn .This will take care of standardizing our function so that scikit's objects know how to use it.Also, because this is a loss function and not a score function, the lower the better, and thus the need to let sklearn to flip the sign to turn this from a maximization problem into a minimization problem:
定义完损失函数以后,我们能够使用sklearn中的make_scorer函数,需要注意标准化我们的函数,以便scikit's的对象能够知道怎么使用它。同样,因为这是一个损失函数(越低越好),而不是打分函数,所以需要让sklearn转换符号,来把最大化问题改为最小化问题。
代码语言:javascript复制import sklearn
MAD = sklearn.metrics.make_scorer(MAD, greater_is_better=False)
rcv4 = RidgeCV(alphas=alphas_to_test, store_cv_values=True,scoring=MAD)
rcv4.fit(reg_data, reg_target)
smallest_idx = rcv4.cv_values_.mean(axis=0).argmin()
alphas_to_test[smallest_idx]
1.0