3

I was wondering about the ethics of using lasso regression for variable selection and then simply entering the selected variables into a standard regression.

Is it kosher to do this?

llewmills
  • 1,429
  • 12
  • 26
  • 1
    probably not because Lasso has an additional L1 constraint compared to the "standard" regression. Both will produce different estimates – user3119750 Oct 19 '17 at 19:30
  • Yes I suspected as much but was interested in why. I just wondered whether the shrinkage towards 0 in a lasso, in its own way, produces another form of bias. I guess I'm interested in which estimates are more 'accurate' (acknowledging the difficulty with notions like accuracy in statistics) – llewmills Oct 19 '17 at 19:42
  • What's your goal with the analysis? Why do the LASSO estimates not meet this goal? – Matthew Drury Oct 19 '17 at 20:07
  • @Matthew Drury my goal is to select the model, from a list of 17 predictors, that best predicts the outcome variable. My boss chose the variables, from a much larger number, that were the most theoretically sound; however 17 seems like too large a number to appear like anything other than fishing. if I am going to fish, I want to do it ethically (i.e. no stepwise regression). But my boss wants p-values and the glmnet package in R doesn't seem to supply any. I wondered if it was ok to use glmnet to supply me with variables, then lm() to run the analysis. I guessed not but wanted to check. – llewmills Oct 19 '17 at 21:39
  • Do you know what your boss intends to do with the p-values? If your goal is predictive power, p-values have nothing to say on that issue, so your boss's request is not inline with the business problem. Here are some simple examples to make that point: https://stats.stackexchange.com/questions/291210/is-it-wrong-to-choose-features-based-on-p-value – Matthew Drury Oct 19 '17 at 23:52
  • There are ways to estimate the variance of the parameter estaimates from a lasso model. If that is the goal your boss is after, it can certainly be done, but p-values are not the answer. – Matthew Drury Oct 19 '17 at 23:55
  • I think I have seen a similar question (or a few of them) before. Isn't this a duplicate? – Richard Hardy Oct 20 '17 at 06:23
  • @Matthew Drury if it were up to me I would drop p-values altogether and just use Lasso, however in Paychology and the health sciences people tend to only engage with a coefficient if it has a p-value next to it. It's a stat that medical doctors, who have little training, can understand. Like a gold star. – llewmills Oct 23 '17 at 06:36
  • It's quite debatable whether medical doctors understand p-values. : ) – Matthew Drury Oct 23 '17 at 19:06
  • Yes quite @MatthewDrury. I was referring to the more gifted among them. – llewmills Oct 23 '17 at 19:07

1 Answers1

1

This is not kosher, but if you do it anyway, I won't tell anyone.

The reason this is frowned upon is because you are performing model selection (that's the second S in LASSO), and in model selection you are reusing your data to figure out the best model. I'm hoping someone else can give you a better explanation mathematically, because I don't think I can. You are simply messing with the conditionality that got you to $\hat\beta^{LASSO}$.

Tim Atreides
  • 708
  • 3
  • 6