0

I ran LASSO with logistic regression to obtain a list of "important" variables. For factor variables, I created one-hot encoded dummy variables using the step_dummy function in the tidymodels world.

After running LASSO, I inspected the list of variables that were kept and noticed that some of dummy variables were deemed "unimportant" by LASSO and were thus set to 0. Does it make sense to only keep some of the dummy variables (i.e., the non-zero ones) when running a final logistic regression model? For example, for race, 5 indicator variables were created using one-hot encoding: White, Black, Asian, Hispanic, and Other. LASSO only deemed White and Hispanic important and dropped the other 3. Is it ok to just include White and Hispanic in my logistic regression to make predictions?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user122514
  • 191
  • 5
  • 1
    What would you do with the data from the Black, Asian, and Other people? – Dave Aug 19 '21 at 15:55
  • I think the data from them is still usable. By not having dummy features for them, it essentially relegates them to a “none of the above” category and pushes them into the intercept term. (Whether this is ethical is another question requiring info outside of the posted question.) – Arya McCarthy Aug 19 '21 at 16:01
  • @Dave their data won't be dropped. Essentially, I would end up with a model equation like this: `logit(p) = intercept + 0.1*white + 0.2*Hispanic+ other variables`. – user122514 Aug 19 '21 at 16:04
  • 1) What happened to the Asian category? 2) How did you come up with those $0.1$ and $0.2$ coefficients? 3) You don't appear to be listening to the LASSO telling you that the estimated coefficient on Black is zero. – Dave Aug 19 '21 at 16:06
  • Sorry, I just made up and example and forgot that White and Hispanic were the only important ones. The `0.1` and `0.2` are completely arbitrary and were only used to show you what I meant. – user122514 Aug 19 '21 at 16:07
  • 1) But how do you get those $0.1$ and $0.2$ coefficients? 2) Where does the "race" variable go for Black, Asian, and Other people? 3) LASSO for variable selection isn't even advisable; see [Frank Harrell's tweet](https://twitter.com/f2harrell/status/1165403177371480064?lang=en) about LASSO selecting the "right" variables. – Dave Aug 19 '21 at 16:08
  • 1) As stated, the coefficients were made up for illustration purposes. 2) LASSO zeroed out these coefficients. I am wondering if I have to include them in my "final" logistic model even though they were zeroed out by LASSO. – user122514 Aug 19 '21 at 16:11
  • If you do not include them in your model (see my comment edit about even using LASSO for variable selection), then what happens to the data from the people who are Black, Asian, and Other? – Dave Aug 19 '21 at 16:13
  • As pointed out by @AryaMcCarthy, their data are still usable. If the equation is `logit(p) = intercept + 0.1*White + 0.2*Hispanic + other variables`, then to calculate the logit for people who are Black/Asian/Other, we would have `logit(p) = intercept + 0.1*0 + 0.2*0 + other variables = logit(p) = intercept + other variables`. I'm not sure why you are saying I'm dropping their data. – user122514 Aug 19 '21 at 16:20
  • Then you're considering them the same race. Perhaps this is fine, but it probably is not, and it is based on a bit of a dubious variable selection process. – Dave Aug 19 '21 at 16:21
  • You should keep them! See https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 and try *group lasso* see https://stats.stackexchange.com/questions/214325/why-use-group-lasso-instead-of-lasso – kjetil b halvorsen Aug 20 '21 at 17:33
  • @kjetilbhalvorsen Thank you! I looked into group LASSO earlier and am in the middle of running it now! :-) – user122514 Aug 20 '21 at 17:56

0 Answers0