We're going to work with some ideas similar to those we saw in the recipe on Lasso Regression.In that recipe, we looked at the number of features that had zero coefficients.Now we're going to take this a step further and use the spareness associated with L1 norms to preprocess the features.
Getting ready准备工作
We'll use the diabetes dataset to fit a regression. First, we'll fit a basic LinearRegression model with a ShuffleSplit cross validation. After we do that, we'll use LassoRegression to find the coefficients that are 0 when using an L1 penalty. This hopefully will help us to avoid overfitting, which means that the model is too specific to the data it was trained on. To put this another way, the model, if overfit, does not generalize well to outside data.
We're going to perform the following steps:我们将要执行以下步骤
1. Load the dataset.载入数据集
2. Fit a basic linear regression model.拟合一个基本的线性回归
3. Use feature selection to remove uninformative features.使用特征选择来移除无信息的特征。
4. Refit the linear regression and check to see how well it fits compared with the fully featured model.
How to do it...怎么做
First, let's get the dataset:首先得到数据集
代码语言:javascript复制import sklearn.datasets as ds
diabetes = ds.load_diabetes()
Let's create the LinearRegression object:让我们创建一个线性回归对象:
代码语言:javascript复制from sklearn import linear_model
lr = linear_model.LinearRegression()
Let's also import the metrics module for the mean_squared_error function and the cross_validation module for the ShuffleSplit cross validation scheme:
让我们导入metrics模型来以便使用mean_squared_error function和the cross_validation模型来进行ShuffleSplit交叉验证。
代码语言:javascript复制from sklearn import metrics
from sklearn import model_selection
shuff = model_selection.ShuffleSplit(diabetes.target.size)
Now, let's fit the model, and we'll keep track of the mean squared error for each iteration of ShuffleSplit :
代码语言:javascript复制mses = []
for train, test in shuff.split(diabetes.data, diabetes.target):
train_X = diabetes.data[train]
train_y = diabetes.target[train]
test_X = diabetes.data[~train]
test_y = diabetes.target[~train]
lr.fit(train_X, train_y)
mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
So now that we have the regular fit, let's check it after we eliminate any features with a zero for the coefficient. Let's fit the Lasso Regression:
代码语言:javascript复制from sklearn import feature_selection
cv = linear_model.LassoCV()
cv.fit(diabetes.data, diabetes.target)
array([ -0. , -226.2375274 , 526.85738059, 314.44026013,
-196.92164002, 1.48742026, -151.78054083, 106.52846989,
530.58541123, 64.50588257])
We'll remove the first feature, I'll use a NumPy array to represent the columns that are to be included in the model:
代码语言:javascript复制import numpy as np
columns = np.arange(diabetes.data.shape[1])[cv.coef_ != 0]
array([1, 2, 3 4, 5, 6, 7, 8, 9])
Okay, so now we'll fit the model with the specific features (see the columns in the following code block):
代码语言:javascript复制l1mses = []
for train, test in shuff.split(diabetes.data, diabetes.target):
train_X = diabetes.data[train][:, columns]
train_y = diabetes.target[train]
test_X = diabetes.data[~train][:, columns]
test_y = diabetes.target[~train]
lr.fit(train_X, train_y)
l1mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(l1mses) - np.mean(mses)
As we can see, even though we get an uninformative feature, the model still fits worse. This isn't always the case. In the next section, we'll compare a fit between models where there are many uninformative features.
How it works...如何运行的
First, we're going to create a regression dataset with many uninformative features:首先我们将生成一个有很多无信息特征的回归数据集
代码语言:javascript复制X, y = ds.make_regression(noise=5)
Let's fit a normal regression:让我们拟合一个普通的回归模型:
代码语言:javascript复制mses = []
shuff = ShuffleSplit(y.size)
for train, test in shuff.split(X, y):
train_X = X[train]
train_y = y[train]
test_X = X[~train]
test_y = y[~train]
lr.fit(train_X, train_y)
Now, we can walk through the same process for Lasso regression:现在我们能使用Lasso回归相同的步骤:
代码语言:javascript复制cv.fit(X, y)
LassoCV(alphas=None, copy_X=True, cv='warn', eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
positive=False, precompute='auto', random_state=None,
selection='cyclic', tol=0.0001, verbose=False)
We'll create the columns again. This is a nice pattern that will allow us to specify the features we want to include:
代码语言:javascript复制import numpy as np
columns = np.arange(X.shape[1])[cv.coef_ != 0]
array([ 0, 9, 21, 24, 26])
mses = []
shuff = ShuffleSplit(y.size)
for train, test in shuff.split(X, y):
train_X = X[train][:, columns]
train_y = y[train]
test_X = X[~train][:, columns]
test_y = y[~train]
lr.fit(train_X, train_y)
mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
As we can see, we get an extreme improvement in the fit of the model. This just exemplifies that we need to be cognizant that not all the models need to be or should be thrown into the model.