Is there an error Section 6.6.2 of the book An Introduction to Statistical Learning?

Question

In Section 6.6.2 of An Introduction to Statistical Learning, the authors do the following:

A) Fit a lasso model

lasso.mod=glmnet(x[train ,],y[ train],alpha=1, lambda =grid)

B) Perform cross-validation

set.seed(1)
cv.out=cv.glmnet(x[train ,],y[ train],alpha=1)
plot(cv.out)
bestlam =cv.out$lambda.min

C) Compute the the test error using the best value of $\lambda$ obtained in part B)

lasso.pred=predict(lasso.mod,s=bestlam ,newx=x[test ,])
mean((lasso.pred -y.test)^2)

But it seems there is an error here? They are using using the $\lambda$ from part B) with the model from part A). Surely they should be using the $\lambda$ from part B) with the model from part B) rather than mixing the results of both A) and B)?

They do the same thing in the previous section (6.6.1) for ridge regression. So if there is a typo/mistake in the Lasso section there is also one in the ridge regression section.

So is it a typo/mistake, or am I mistaken?

The authors use the same training dataset and the same enet parameter in (A) and (B), and the coefficients are estimated for each value of the penalty in (A); step (B) only helps in finding the optimal penalty value, so I don't think there's an error in this case. — chl, Oct 06 '20 at 16:15
Yes but as the penalty value $\lambda$ varies in part (B), different coefficients are generated for the model. A specific set of coefficients are associated with the optimal penalty. But then the authors take the optimal penalty value and use it with coefficients that are coming from part A). So it still seems like there is a mistake here. — ManUtdBloke, Oct 06 '20 at 16:24
What is happening 'under the hood' when $\lambda$ from part B) is combined with the results of part A)? — ManUtdBloke, Oct 06 '20 at 16:42

EdM · Accepted Answer · 2020-10-06T17:44:59.853

Step A doesn't provide a single model; it provides a set of models, one for each value of $\lambda$, developed on all of x[train,] and y[train]. There is no single model in Step B even for a single value of $\lambda$. At the default 10-fold cross validation in cv.glmnet, you develop 10 different models for each value of $\lambda$. Here's what's going on "under the hood."

Cross validation tries to estimate what would happen if you repeated a modeling process on different samples from a population. For each fold of 10-fold CV, you build a model on 90% of the cases as an internal "training" set, then evaluate performance on the remaining 10% as an internal "test" set. After all 10 folds, each case has been included in one internal "test" set and 9 internal "training" sets. The performance (here, mean-square error on the 10 internal "test" sets) is averaged over all those 10 models developed at that $\lambda$ value.

Those 10 lasso models at a single $\lambda$ value may differ not only in regression coefficients but even in the predictors that are selected for inclusion. That's OK. The point of evaluating over the range of $\lambda$ penalty values, as @chl notes in a comment, is to find an optimal penalty value that best balances off the bias and variance, to minimize the expected mean-square error when you apply your modeling process to the underlying population. That's a key concept to recognize. In many circumstances it's the modeling process that you're evaluating, not the model itself.

Then you go back to your original set of models developed in Step A, over a grid of $\lambda$ values, and select the model developed at that optimal value of $\lambda$. Yes, the details of that model will differ from all of the 10 models developed at that value of $\lambda$ during cross validation. But that choice of $\lambda$ means that the resulting model still should have the best expected performance when applied to new cases from the population.

The models in Step A) are associated with a grid of $\lambda$ values as you say. Let suppose there are $n$ of them in this grid: $\lambda_1,\lambda_2,\dots,\lambda_n$. Then in Step B) you say we find an optimal $\lambda_\text{opt}$, and go back to the original set of models developed in in Step A), and select the model at $\lambda_\text{opt}$. But how do we know that $\lambda_\text{opt}$ is featured in the original set of models? Is the $\lambda_\text{opt}$ from cross-validation guaranteed to be in the grid $\lambda_1,\lambda_2,\dots,\lambda_n$? — ManUtdBloke, Oct 07 '20 at 14:51
If $\lambda_\text{opt}$ isn't in the grid, then how is it being used to select a model from Step A), when the models from Step A) are each associated with a particular $\lambda_i$ in the grid? — ManUtdBloke, Oct 07 '20 at 14:56
@ManUtdBloke the manual page for `glmnet` describes the interpolation versus refitting methods used if the particular $\lambda_{opt}$ isn't in the original grid, in its description of `exact` argument settings of TRUE versus FALSE. In practice, I tend to start with `cv.glmnet`, find $\lambda_{opt}$, and run `glmnet` just with that one value of $\lambda$. — EdM, Oct 07 '20 at 15:34
(+1) The [±1 SE rule](https://stats.stackexchange.com/q/138569/930) is also a good option to consider in many settings. — chl, Oct 07 '20 at 19:21

Is there an error Section 6.6.2 of the book An Introduction to Statistical Learning?

1 Answers1