Variable selection for logistic regression with Firth's bias reduction method

Question

I'm dealing with a sample of moderate size, and the binary outcome I try to predict suffers from quasi-complete separation. Thus, I apply logistic regression models using Firth's bias reduction method, as implemented for example in the R package brlgm2 or logistf.

Both packages are very easy to use. However, brglm2 proposes no method at all for variable selection, and logistf only propose a simple stepwise method. As far as I know, it seems that "advanced" variable selection (best subset regression, or something like that) does not really exist yet for bias-corrected logistic regression models.

Is it possible to perform both Firth's bias-reduction and advanced variable selection for logistic regression?

Thanks!

Please say more about why you wish to perform variable selection. How many predictors do you have, and how many members of the less-prevalent class? — EdM, Aug 04 '20 at 12:09
Thanks. Here, variable selection can be seen as a (secondary) goal in its own right, so as to find which combination(s) of variables are the most relevant. I have several datasets with similar structure, but they have generally 8 to 11 variables, and about 15-20 individuals in each group. — Philopolis, Aug 04 '20 at 13:24

score 3 · Accepted Answer · answered Aug 04 '20 at 16:29

Relative to the number of predictors you're considering, what you have is a very small sample, not one of "moderate size." An adequately sized dataset for logistic regression is typically on the order of 10-20 cases in the smaller class per predictor, unless there is penalization. With about 10 predictors and 15-20 cases per group you only have about 10% of that. So it's not surprising that you found perfect separation and had to move to the penalized Firth approach just to get coefficients at all.

One way to get around the problem would be to use your knowledge of the subject matter to combine the various similar datasets noted in your comment. If the samples are from the same population but some datasets are missing certain predictors present in other datasets, you could consider multiple imputation as a way to combine all the information you have.

Automated methods for model selection have many problems. Best-subset selection, even with cross-validation to minimize overfitting (as provided by the bestglm package that you linked), will depend heavily on the particular data sample at hand. With typical data sets containing correlated predictors, you will find that the identities of the "best subset" members will change from fold to fold of CV, or among models fit to bootstrap samples of the data. So there won't be a single "best model"; the best you can do is show that the modeling process provides a result that is reasonably likely to meet your requirements.

So the answer depends on the requirements you have for doing this modeling. If you are interested just in prediction you could consider the different type of penalization provided by ridge regression instead of the Firth approach. Unlike the form of penalty provided by the Firth approach, the magnitude of the ridge penalty can be chosen by cross-validation to minimize deviance. That keeps all the predictors in the model but with penalized coefficients to minimize overfitting.

If you need to cut down on the number of predictors for some reason, LASSO might work, returning just a few predictors, but I suspect that with these small datasets you will come up against perfect separation again. Even if it works, as with best-subset you will have to recognize LASSO's arbitrary selection from among the predictor set. Backward stepwise selection is not necessarily a bad way to go. It's probably the least objectionable of the stepwise approaches, and I know professional statisticians who use it regularly. Again, recognize that the particular subset retained might be very sample-dependent.

Whichever approach you choose, do try to estimate the quality of your modeling process by repeating all the steps (including the predictor selection algorithm) on multiple bootstrap samples of the data and test their performance on the full original data set. Even though there will be differences among bootstraps in terms of the particular predictors maintained by predictor-selection models, you can at least document that the model-development process is reasonably effective.

Thanks for your great answer. Do you have a reference for your first pargraph ("on the order of 10-20 cases in the smaller class per predictor")? Generally speaking, it is hard to find precise rules saying whether or not it is reasonable to apply a given supervised learning method for a given sample size. For the small sample I have (although in my disciplinary field, 15-20 individuals is a good sample size :-)), are there some methods which could work slightly better, or are theoretically less "demanding" as regards the number of cases? (LDA?) — Philopolis, Aug 05 '20 at 07:11
@Philopolis see the discussion on [this page](https://stats.stackexchange.com/q/26016/28500) for sample size. The particular type of binary outcome analysis can’t get around the overfitting problem itself. To avoid overfitting you use fewer predictors to start or you penalize the predictors in some way. LASSO and ridge penalize coefficient magnitudes. Gradient-boosted trees, when properly implemented, learn slowly to accomplish something similar. — EdM, Aug 05 '20 at 10:54

Variable selection for logistic regression with Firth's bias reduction method

1 Answers1