Post-Lasso OLS: What does statistical significance mean?

Question

I'm doing Post-Lasso OLS i.e. running a Lasso regression on some data, and then running an OLS regression with only the variables that had non-zero coefficients when doing Lasso.

Should I care about statistical significance in the OLS stage? If so, how? Using the standard threshold of statistical significance feels a bit conservative to me, given that the variable was already shortlisted in the first stage by Lasso. To give a concrete example, I feel uncomfortable saying that there is not enough evidence to conclude there is a correlation simply because a coefficient has a t-statistic of (say) 1.5.

score 1 · Answer 1 · answered Mar 03 '20 at 08:32

The statistical inference (p-values etc) for OLS post-lasso is going to be invalid.

In general, your standard errors are going to be too small since OLS disregards the uncertainty in the model selection phase. Post-lasso inference is quite involved and I do not have the confidence to be able to explain all the details.

For applied work, you might want to take a look at the HDCI package, and its function ?LassoOLS.

I hope this helps.

PS. Here's a related question How does it make sense to do OLS after LASSO variable selection?

Post-Lasso OLS: What does statistical significance mean?

1 Answers1