1

I am trying to use the LASSO technique to identify which variables to include in my model. I used cross validation to identify the value of lambda which minimizes the CV error. For this minimal value of lambda, I get the list of variables with non zero coefficients.

I then ran a logistic regression including the variables with non zero coefficients, that I got from the LASSO output. However, when I view the model summary many of the variables are highly insignificant (as per the $p$-value of the coefficient)

Is that expected? Or am I doing something wrong here?

I am using the glmnet package in R.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Dataminer
  • 365
  • 3
  • 12
  • Maybe just the logistic regression isn't the right model for your dataset. And also I wouldn't worry much about p-values as long as the test error is reasonable. – Alexey Grigorev May 25 '15 at 11:29
  • 1
    There's no need to run the model again after doing cross-validation (you just get the coefficients from the output of cv.glmnet), and in fact if you fit the new logistic regression model without penalisation then you're defeating the purpose of using lasso. Having many of the variables non-significant is neither expected nor a sign of something wrong. – mark999 May 25 '15 at 11:47
  • As @mark999 stated this approach is not statistically valid. You can't pretend that estimates can be unpenalized when you originally used penalized maximum likelihood estimation. The problem with penalized estimates is that you can't get statistical inference unless you are Bayesian. – Frank Harrell May 25 '15 at 13:05

0 Answers0