We're going to work with some ideas similar to those we saw in the recipe on Lasso Regression.In that recipe, we looked at the number of features that had zero coefficients.Now we're going to take this a step further and use the spareness associated with L1 norms to preprocess the features.
我们这里要学习的思想很像我们之前章节学习的Lasso回归。在这部分,我们发现很多系数为0的特征,现在我们将深入这个步骤并且使用与L1范式有关的缺失来预处理特征
Getting ready准备工作
We'll use the diabetes dataset to fit a regression. First, we'll fit a basic LinearRegression model with a ShuffleSplit cross validation. After we do that, we'll use LassoRegression to find the coefficients that are 0 when using an L1 penalty. This hopefully will help us to avoid overfitting, which means that the model is too specific to the data it was trained on. To put this another way, the model, if overfit, does not generalize well to outside data.
我们将使用糖尿病数据集来拟合一个回归模型。首先,我们拟合一个含有ShuffleSplit交叉验证的基本线性回归。做完以后,我们使用LassoRegression来找到在L1惩罚下为0的系数。这将帮助我们避免过拟合(模型训练的太过明确),如果模型过拟合,将把外来数据推向不能规范化好的另一条路上。
We're going to perform the following steps:我们将要执行以下步骤
1. Load the dataset.载入数据集
2. Fit a basic linear regression model.拟合一个基本的线性回归
3. Use feature selection to remove uninformative features.使用特征选择来移除无信息的特征。
4. Refit the linear regression and check to see how well it fits compared with the fully featured model.
重拟合线性回归并且检查它相比训练所有数据有哪些良好的表现。
How to do it...怎么做
First, let's get the dataset:首先得到数据集
代码语言:javascript复制import sklearn.datasets as ds
diabetes = ds.load_diabetes()
Let's create the LinearRegression object:让我们创建一个线性回归对象:
代码语言:javascript复制from sklearn import linear_model
lr = linear_model.LinearRegression()
Let's also import the metrics module for the mean_squared_error function and the cross_validation module for the ShuffleSplit cross validation scheme:
让我们导入metrics模型来以便使用mean_squared_error function和the cross_validation模型来进行ShuffleSplit交叉验证。
代码语言:javascript复制from sklearn import metrics
from sklearn import model_selection
shuff = model_selection.ShuffleSplit(diabetes.target.size)
Now, let's fit the model, and we'll keep track of the mean squared error for each iteration of ShuffleSplit :
现在,让我们拟合模型,我们将保持每一个ShuffleSplit迭代使用均方误差。
代码语言:javascript复制mses = []
for train, test in shuff.split(diabetes.data, diabetes.target):
train_X = diabetes.data[train]
train_y = diabetes.target[train]
test_X = diabetes.data[~train]
test_y = diabetes.target[~train]
lr.fit(train_X, train_y)
mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(mses)
2866.381286447516
So now that we have the regular fit, let's check it after we eliminate any features with a zero for the coefficient. Let's fit the Lasso Regression:
所以现在我们进行了常规的拟合,让我们在排除了系数为0的特征后检查一下,拟合Lasso回归:
代码语言:javascript复制from sklearn import feature_selection
cv = linear_model.LassoCV()
cv.fit(diabetes.data, diabetes.target)
cv.coef_
array([ -0. , -226.2375274 , 526.85738059, 314.44026013,
-196.92164002, 1.48742026, -151.78054083, 106.52846989,
530.58541123, 64.50588257])
We'll remove the first feature, I'll use a NumPy array to represent the columns that are to be included in the model:
我们将移除第一个特征,我将在模型中使用Numpy数组来代替列:
代码语言:javascript复制import numpy as np
columns = np.arange(diabetes.data.shape[1])[cv.coef_ != 0]
columns
array([1, 2, 3 4, 5, 6, 7, 8, 9])
Okay, so now we'll fit the model with the specific features (see the columns in the following code block):
好了,我们现在拟合有特殊特征的模型(用以下代码查看列)
代码语言:javascript复制l1mses = []
for train, test in shuff.split(diabetes.data, diabetes.target):
train_X = diabetes.data[train][:, columns]
train_y = diabetes.target[train]
test_X = diabetes.data[~train][:, columns]
test_y = diabetes.target[~train]
lr.fit(train_X, train_y)
l1mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(l1mses)
2871.7973251720105
np.mean(l1mses) - np.mean(mses)
5.416038724494683
As we can see, even though we get an uninformative feature, the model still fits worse. This isn't always the case. In the next section, we'll compare a fit between models where there are many uninformative features.
如我们所见,尽管我们得到一个无意义的特征,模型依然拟合的很差,这不总是事实,在下一步法,我们将比较有很多无信息特征的模型之间的拟合
How it works...如何运行的
First, we're going to create a regression dataset with many uninformative features:首先我们将生成一个有很多无信息特征的回归数据集
代码语言:javascript复制X, y = ds.make_regression(noise=5)
Let's fit a normal regression:让我们拟合一个普通的回归模型:
代码语言:javascript复制mses = []
shuff = ShuffleSplit(y.size)
for train, test in shuff.split(X, y):
train_X = X[train]
train_y = y[train]
test_X = X[~train]
test_y = y[~train]
lr.fit(train_X, train_y)
mses.append(metrics.mean_squared_error(test_y,lr.predict(test_X)))
np.mean(mses)
195.60396533629236
Now, we can walk through the same process for Lasso regression:现在我们能使用Lasso回归相同的步骤:
代码语言:javascript复制cv.fit(X, y)
LassoCV(alphas=None, copy_X=True, cv='warn', eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
positive=False, precompute='auto', random_state=None,
selection='cyclic', tol=0.0001, verbose=False)
We'll create the columns again. This is a nice pattern that will allow us to specify the features we want to include:
我们将重新生成列,这是个能让我们说明我们想要包含的特征的好的模式。
代码语言:javascript复制import numpy as np
columns = np.arange(X.shape[1])[cv.coef_ != 0]
columns[:5]
array([ 0, 9, 21, 24, 26])
mses = []
shuff = ShuffleSplit(y.size)
for train, test in shuff.split(X, y):
train_X = X[train][:, columns]
train_y = y[train]
test_X = X[~train][:, columns]
test_y = y[~train]
lr.fit(train_X, train_y)
mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(mses)
9.59713146586824
As we can see, we get an extreme improvement in the fit of the model. This just exemplifies that we need to be cognizant that not all the models need to be or should be thrown into the model.
如我们所见,我们在拟合模型上得到极大的改善,这只是个典型例子,我们必须认清并不是所有的模型都要放入这个模型。