3

Does it ever make sense to check for multicollinearity and perhaps remove highly correlated variables from your dataset prior to running LASSO regression to perform feature selection?

One of the scientists I am working with is highly concerned that by not dealing with multicollinearity before LASSO regression, the LASSO model will perform poorly, though I'm not sure what the general consensus is for this. I was thinking that because LASSO will shrink some coefficients to zero, multicollinearity is remedied. Any thoughts or suggestions?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user122514
  • 191
  • 5
  • 3
    Will you be using the LASSO to select features and then fit a regression on those features? // What is the purpose of the modeling, pure prediction? – Dave Jul 15 '21 at 15:34
  • 1
    Not to stir the pot, but if variables are highly collinear (and highly predictive), then LASSO will on average tend to select just one and discard the other with no preference. As far as a predictive routine, that seems just fine by reckoning, i.e. given two predictive variables with the same information I don't care which one I use, just that I use one of them. – AdamO Jul 15 '21 at 15:45
  • @Dave Yes, our goal is to use LASSO to perform feature selection and ultimately run a logistic model using the coefficients selected by LASSO (as well as clinical variables we know are important). The model will be used for prediction purposes. – user122514 Jul 15 '21 at 15:48
  • 1
    So why not run the regularized regression on all of your variables and cross validate to find the hyperparameter giving the best performance? – Dave Jul 15 '21 at 15:49
  • @Dave, that was my initial thought process and what I presented to my colleague yesterday. However, using cross-validation to select the "best" lambda gave us a model with a little over 200 features. That's when she immediately felt concerned about LASSO keeping too many correlated features. – user122514 Jul 15 '21 at 15:57
  • Why does that matter if that's what gives the best predictive performance? // It sounds like she wants to use those 200 features in an OLS regression, using LASSO for feature selection. Is that accurate? – Dave Jul 15 '21 at 16:02
  • Because she wants to implement this model in the electronic health record, she wants to arrive at a more "condensed" model that performs "fairly well" (too idealistic, I know). Originally, we had over 300 variables, so LASSO dropped about 100 of them, which I thought was quite good already. However, we went back and forth about multicollinearity, and she asked me to remove correlated variables prior to LASSO, but I'm just not sure if it's sound to do so. – user122514 Jul 15 '21 at 16:08
  • If you have "too many" features -- perhaps because there are engineering requirements to only use a certain number of features -- you can just increase the penalty until the model only includes the desired number of features. This might reduce the performance of the model "too much," but then you have to ask yourself what your goal is: to build a model that only selects less than some number of features, or a model that hits some performance benchmark. Either could be valid, depending on your setting. – Sycorax Jul 15 '21 at 16:09
  • Granted, at the expense of some performance... – Dave Jul 15 '21 at 16:10
  • All these suggestions make sense. I'll definitely run them by her and see what she thinks. Thank you all! – user122514 Jul 15 '21 at 16:13
  • @AdamO: What you say disagrres with my answer at https://stats.stackexchange.com/questions/264016/why-lasso-or-elasticnet-perform-better-than-ridge-when-the-features-are-correlat/264118#264118 – kjetil b halvorsen Jul 15 '21 at 16:40
  • You could try elasticnet, see https://stats.stackexchange.com/questions/264016/why-lasso-or-elasticnet-perform-better-than-ridge-when-the-features-are-correlat/264118#264118 – kjetil b halvorsen Jul 15 '21 at 16:41
  • @kjetilbhalvorsen I'm struggling, and may need to explore a little simulation to convince myself otherwise. The virtue of LASSO is that it enforces sparsity. So coefficients close to 0 are forced to be 0. The unspoken assumption here is that our collinear variables are both strongly predictive. Then it gets interesting. I figured that after control for the first variable, the *second* has relatively little predictiveness, so even then the LASSO penalty tends to rule to exclude the second variable. – AdamO Jul 19 '21 at 16:50

0 Answers0