2

If I am dealing with a small sample size (n = 48; n = 29 have disease vs n = 19 without disease), what are the maximum numbers of the predictors I can include in my multivariable logistic regression model (I am building a predictive model)?

There are so many rules online and I am not sure which one I should use.

Any help is much appreciated!

R Beginner
  • 191
  • 1
  • 5
  • See https://stats.stackexchange.com/questions/502150/on-the-existence-of-rule-of-thumb-for-machine-learning-algorithms/502158#502158 – kjetil b halvorsen Nov 24 '21 at 19:38
  • 3
    If you are building a *predictive* model then you can probably include as many variables as you think are relevant and use regularisation (L1 or L2 or both) to avoid over-fitting (don't do feature selection, it is more likely to make things worse rather than better if you have regularization) – Dikran Marsupial Nov 24 '21 at 19:42
  • @DikranMarsupial Can you pls share the R code for L2 or L2 method? Any sample example would be much appreciated – R Beginner Nov 24 '21 at 20:36
  • 1
    @RBeginner unfortunately I don't use R, but any good implementation of logistic regression ought to provide L2 regularisation (c.f. ridge regression). There are undoubtedly packages for L1 regularisation (look for "LASSO") and both L1 and L2 regularisation (look for "elastic net"). – Dikran Marsupial Nov 25 '21 at 07:38
  • @DikranMarsupial No worries! So, in your opinion, we don't have to worry about the p-value for inclusion into my predictive model (e.g. only included predictors in my final model with p<0.05)? – R Beginner Nov 25 '21 at 18:29
  • 1
    @RBeginner Not if you are using regularisation, if the attribute is not doing anything useful, it will probably have a very small weight and the model is essentially ignoring it anyway. L1 regularisation may set it exactly to zero. – Dikran Marsupial Nov 25 '21 at 19:23
  • @DikranMarsupial So essentially we can include every predictor first and run the penalized lasso model analysis. And we just remove whichever has a coefficient of zero to conclude the final predictive model, correct? – R Beginner Nov 25 '21 at 22:47
  • 1
    @RBeginner yes, however that is just if you want a predictive model. Personally I prefer L2 regression and just keep all of the parameters in the model as it simplifies setting the regularisation parameter (there are no special values where an attribute joins or leaves the model) – Dikran Marsupial Nov 26 '21 at 07:27
  • @DikranMarsupial If you choose L2 regression, will you then leave those parameters with very small coefficients in the final predictive model? – R Beginner Nov 26 '21 at 14:11
  • 1
    @RBeginner yes, they have small weights so they are doing no real harm. – Dikran Marsupial Nov 26 '21 at 14:21

1 Answers1

5

A useful rule of thumb for logistic regression is to limit yourself to about 1 unpenalized predictor per 15 cases of the minority class. See Section 4.4 of Frank Harrell's course notes, for example. That's when you have a

typical problem in medicine, epidemiology, and the social sciences in which the signal:noise ratio is small.

See for example, this page linked in a comment from kjetil b halvorsen, and this page. If your signal:noise ratio is higher, you can get away with fewer cases per predictor.

I highlighted the word "unpenalized" above because you don't have to throw out all except 1 or 2 of your predictors. A penalized method ("regularization" mentioned in one of the comments) allows you to use more predictors than that rule of thumb.

The regression coefficients of the predictors are penalized to lower magnitudes than they would be in a standard regression, to help avoid overfitting. The penalty that provides best performance is typically chosen by cross-validation. Ridge regression ("L2 regularization") provides coefficients for all predictors. LASSO ("L1 regularization") provides penalized coefficients for some predictors and sets coefficients of others to 0. My guess is that you would be better served by ridge regression here, perhaps after you apply your knowledge of the subject matter to reduce the effective number of predictors. See Harrell's notes for ideas on how to implement data reduction, to cut down on the numbers of predictors without using the outcomes.

For logistic regression, penalization is implemented in the glmnet package.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thank you for the great contribution. So, if I decide to use an unpenalized approach, I can roughly fit 2 only predictors in my data (cases mean participants with a disease, correct)? – R Beginner Nov 24 '21 at 20:34
  • @RBeginner that's correct. The limiting number is the size of the minority class regardless of how you name the classes, so in your case it's limited by the 19 without disease. Thus even 2 predictors would be pushing it and could overfit. I recommend that you try ridge regression with `glmnet()`, illustrated in Section 6.6.1 of [ISLR](https://www.statlearning.com/s/ISLRSeventhPrinting.pdf) for ordinary least squares but directly applicable to your logistic regression with a `family="binomial"` argument. – EdM Nov 24 '21 at 21:49
  • Does the minority class mean the group with the smaller sample size? – R Beginner Nov 25 '21 at 01:24
  • Can you pls share a R code with the ```glmnet( )```? Is it similar to the logistic regression one (e.g. ```glm```)? – R Beginner Nov 25 '21 at 01:25
  • @RBeginner the minority group is the one with the smaller sample size. ISLR in my prior comment is freely available and the cited section 6.6.1 shows how to do ridge regression with `glmnet()`, including use of `cv.glmnet()` to find the optimum penalty. You specify the argument `alpha = 0` to do ridge regression. If you also use a `family="binomial"` argument it will do logistic ridge regression. Unlike other R functions, you have to provide separate outcome and predictor vectors/matrices, however. – EdM Nov 25 '21 at 02:48
  • I saw online that ridge regression is an extension of linear regression, is this correct? If so, does it mean it actually does not apply to my logistic regression model? – R Beginner Nov 25 '21 at 17:36
  • 1
    @RBeginner that's not correct, if by "linear regression" you mean ridge regression is restricted to an extension of ordinary least squares. Ridge regression can be used in all types of generalized linear models, including binomial/logistic regression. – EdM Nov 25 '21 at 17:40
  • Thanks for the clarification. Btw, can we also compute the odds ratio from the ridge/lasso regression just like we did in the logistic regression? – R Beginner Nov 25 '21 at 18:16
  • 1
    @RBeginner you still get regression coefficients, it's just that their magnitudes have been reduced to avoid overfitting. – EdM Nov 25 '21 at 18:41
  • Got it! So lasso will automatically remove predictors that are not contributing significantly to the model while ridge will just reduce the magnitudes of the regression coefficients (but keep everything as it is)? – R Beginner Nov 25 '21 at 18:45