5

I applied lasso regression for variable selection, and out of 10 variables lasso selected 4 variables.

fit.lasso=glmnet(x,y,alpha=1)

plot(fit.lasso,xvar="lambda",label=TRUE)

cv.lasso=cv.glmnet(x,y)

coef(cv.lasso)

Suppose after this command, we get 4 variables which have non-zero coefficient value, i.e: x1, x2, x3, x4.

Then, I used this command:

fit = lm( y~x1+x2+x3+x4)

And check p values corresponding to x1, x2, x3, x4. Then if x3 have p-value > 0.05, do I need to remove it in the final model?

Or , using Coef(CV.lasso) we get coefficients values use for predicting y?

And hints?

I am using R

Metariat
  • 2,376
  • 4
  • 21
  • 41
Kiran Prajapati
  • 75
  • 1
  • 2
  • 9

1 Answers1

11

After you have done LASSO you should generally NOT use the selected variables in a separate linear regression.

There are several ways to select a subset of predictor variables for a model. For example, you could use stepwise regression or, with few enough predictors, you could examine all possible subsets of predictors. In these cases a criterion like AIC is used to trade off the fit against the number of variables included.

But in common use the selected variables are then simply incorporated into a standard linear regression. The p-values in that linear regression are not valid, as they do not incorporate the fact that you had already performed outcome-based variable selection. Also, if there are correlations among predictors, the particular variables that you choose can depend heavily upon the particular data sample you analyzed. Making this worse, the regression coefficients for those selected from a set of correlated predictors will tend to be larger in magnitude than their true values in the population. Thus the results from these types of linear models can have poor performance on new samples from the population. You can examine these behaviors by analyzing multiple bootstrap samples from your data set.

Although LASSO also may select different sets of variables on different data samples, it has a major advantage over those other approaches. It also penalizes the regression coefficients of the selected variables, lowering their magnitudes from those in a standard linear regression. This penalization typically improves the ability to predict results on new data samples.

So if you simply take the LASSO-selected variables and put them into a new linear regression, not only do you have the problems imposed by all variable selection approaches but also you have lost the LASSO advantage of penalizing coefficients of the selected variables to improve prediction. The p-values for that linear regression will be no more valid than for stepwise or best-subset selection. (Removing "insignificant" predictors in a multiple regression based on a p-value cutoff is not a good idea in any event).

So your proposed approach would undo the good that you did by choosing a principled approach like LASSO, with its penalization, in the first place. As Richard Hardy notes in a comment, there can be ways to use LASSO for variable selection to incorporate into linear regressions, but those are specialized multi-step approaches for particular circumstances, and they don't seem to give any advantage over glmnet() in your application.

So stick with the predictors and coefficients that LASSO provided.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • sir, so using coefficient values that lasso gives, that use for predict y ? And use predict command? Predict= predict(CV.lasso, newx) ? Using coef(CV.lasso) command how to get predict y values for my 10 observation? – Kiran Prajapati Mar 26 '17 at 16:27
  • can you give me the predict command for lasso, it will very helps. – Kiran Prajapati Mar 26 '17 at 16:29
  • 1
    Note that people do do OLS after lasso, see e.g. [this](http://stats.stackexchange.com/questions/253641/how-does-it-make-sense-to-do-ols-after-lasso-variable-selection/253664#253664) and [this](http://stats.stackexchange.com/questions/213077/lasso-for-cherry-picking/213550#213550). – Richard Hardy Mar 26 '17 at 16:45
  • @Richard Hardy, okay sir so can I use here ols after lasso or not? – Kiran Prajapati Mar 26 '17 at 17:23
  • 1
    Although people do perform OLS after LASSO, as @RichardHardy notes, best practice then requires a detailed mutli-step process as in [this example](http://stats.stackexchange.com/a/253664/28500) or adjustment for pre-selection of variables. For this application I don't see any advantage to going beyond the `glmnet` LASSO results, for which the `coef()` and `predict()` functions provide tools for prediction on new data. Section 6.6 of [ISLR](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf) has examples of predictions with glmnet ridge and LASSO results. – EdM Mar 26 '17 at 18:04
  • 1
    @KiranPrajapati, pure lasso could be fine for prediction.. You may read the abstract and the introduction of the Belloni et al. (2014) paper in the first link I provided to see when their method should be used. – Richard Hardy Mar 26 '17 at 18:06
  • @EdM , thank you very much. This book is really helpful. – Kiran Prajapati Mar 26 '17 at 19:54
  • @EdM Could you elaborate **good** at `putting them into a new linear regression undoes the good that you did by choosing a principled approach like LASSO in the first place` – SIslam Apr 01 '17 at 10:07
  • @SIslam hope my additions help. – EdM Apr 01 '17 at 16:22