0

I'm working on building a prediction model. I used group LASSO to perform some variable selection and ended up with a model that performs quite well. However, there are about 100 inputs right now and my collaborator wants to see if we can trim down on some more variables and suggested tossing out variables that aren't significant (p > 0.05).

If I decide to exclude these variables from my model, do I have to train and refit another model using the smaller list of variables? Or does it make sense to simply use the coefficients from the old model and just delete coefficients associated to the insignificant variables? I'm worried that refitting a new model will result in bias since we already saw the data and results. Any help or advice would be greatly appreciated.

user122514
  • 191
  • 5
  • It does, but not entirely. The goal of our model is to have a high performing model, which is something I built by first separating the dataset into training/test and evaluating the performance of the model on the test dataset. However, my collaborator is asking if we can cut down on the number of variables in our model and still have a relatively good model. My question is: if I do remove some variables, do I have to go through the process of training and testing the model again? – user122514 Oct 14 '21 at 21:45
  • 1
    Yes, you have to retrain/test the model. Variable selection is part of the model, and so you have to validate the procedure. If you do perform variable selection, I suggest you use a nested cross validation strategy. – Demetri Pananos Oct 14 '21 at 22:37
  • Our current workflow looks like this: use group LASSO for variable selection, keep variables selected by LASSO, fit model with variables selected by LASSO using regular logistic regression, and evaluate model performance on test dataset. If I retrain/test the model, would I have to perform LASSO on the smaller subset of variables (after tossing out the insignificant ones)? – user122514 Oct 14 '21 at 22:43
  • Yes you do have to do it all over again. Worse, you discard the worst one first. Reprocess, and only then determine which parameter is then newly the least contributory to next eliminate it. Alternatively, you can use [PCA](https://builtin.com/data-science/step-step-explanation-principal-component-analysis). – Carl Oct 24 '21 at 08:39

1 Answers1

1

As you are using a flavor of LASSO, you probably already have the answer to your colleague's question from your model building.

You had to choose a penalty factor for LASSO. That presumably was done by some form of cross-validation or train/test evaluations, finding the penalty factor that gave best performance based on your metric of interest.

So look back at those results to see what happened to your metric of interest when you increased the penalty factor far enough to get down to the desired number of predictors. That's the most principled way to proceed and shouldn't require "refitting" at all, if you kept the data leading to the initial penalty choice. The more parsimonious model will almost certainly work worse; the question is how much worse and is the higher parsimony worth that cost.

You shouldn't be citing p-values for "significance" and such with LASSO unless you are very careful; see Statistical Learning with Sparsity, Chapter 6. You certainly shouldn't just select predictors from LASSO, use them in unpenalized regression, and accept those p-values, as that doesn't take into account your having used the data to identify the predictors in the first place.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • 1
    Thanks, your comment about the penalty factor makes a lot of sense. Because we plan to implement our model in an electronic health record system that does not support LASSO, we have been pushed to use LASSO for variable selection and fitting the "important" variables using regular logistic regression. I did find a couple of references out there that say this is sensible. What are your thoughts? – user122514 Oct 14 '21 at 22:14
  • @user122514 just use the penalized regression coefficients that are returned by LASSO at whatever penalty factor you end up with. LASSO returns penalized regression coefficients that minimize overfitting. If you select those predictors and go back to a new logistic regression, you risk poor performance on new data outside your original sample. See [this answer](https://stats.stackexchange.com/a/269952/28500) and the link in its comments. – EdM Oct 15 '21 at 02:19