1

I'm doing Post-Lasso OLS i.e. running a Lasso regression on some data, and then running an OLS regression with only the variables that had non-zero coefficients when doing Lasso.

Should I care about statistical significance in the OLS stage? If so, how? Using the standard threshold of statistical significance feels a bit conservative to me, given that the variable was already shortlisted in the first stage by Lasso. To give a concrete example, I feel uncomfortable saying that there is not enough evidence to conclude there is a correlation simply because a coefficient has a t-statistic of (say) 1.5.

wwl
  • 668
  • 1
  • 6
  • 17

1 Answers1

1

The statistical inference (p-values etc) for OLS post-lasso is going to be invalid.

In general, your standard errors are going to be too small since OLS disregards the uncertainty in the model selection phase. Post-lasso inference is quite involved and I do not have the confidence to be able to explain all the details.

For applied work, you might want to take a look at the HDCI package, and its function ?LassoOLS.

I hope this helps.

PS. Here's a related question How does it make sense to do OLS after LASSO variable selection?

Otto Kässi
  • 168
  • 4