0

Dear Cross Validated community,

I am conducting a multiple logistic regression for a case-control study. There are too many predictors to be investigated. My question is "Can I choose which predictors to be included in the regression model based on clinical knowledge of the consultant PI in the research team instead of selecting them based on univariate statistical significance?"

I believe selecting variables based on their statistical significance in a univariate analysis test is biased. What are other options to considered?

3 Answers3

1

First, it's always important to start with knowledge of the subject matter.

Second, selection based on "univariate"* models can be particularly misleading with logistic regression, as it is prone to omitted-variable bias in a way that can diminish your chance of finding true associations of predictors with outcome. Unlike with least-squares regression, the omitted predictors don't even need to be associated with the included predictors to get bias with logistic regression. That's also true for survival modeling. So it's better to avoid that selection process.

How to proceed beyond that depends on the purpose of the modeling.

If your primary interest is in prediction, consider a modeling approach that uses as many predictors as possible while avoiding overfitting. If you for example use ridge regression, with cross-validation to choose the penalty based on minimizing deviance (log-loss), there is no such thing as "too many predictors." All predictors are kept, with regression coefficient magnitudes diminished to avoid overfitting. Or you could use elastic net to cut down some on the number of predictors while avoiding overfitting.

If you need a more focused approach, follow the recommendations in Frank Harrell's course notes or book. They provide ways to deal with such situations via a combination of methods, cutting down on the number of predictors included without leading to bias in the final model. For example, multiple predictors representing essentially the same phenomenon might be combined into a single predictor.

If there is a particular variable like a treatment in which you are interested, and you wish to check its relationship with outcome while adjusting for other factors, you could consider a hybrid of the above approaches: penalizing the factors you want to adjust for with ridge while keeping the treatment variable unpenalized. This paper illustrates that approach, and discusses others.


  • I take that to mean single-predictor models. It's often preferred to use "univariate" and "multivariate" to refer to the number of response/outcome variables, not the number of predictors, although that preference isn't universal.
EdM
  • 57,766
  • 7
  • 66
  • 187
  • (+1) Steyerberg, *Clinical Prediction Models* is good too. – Scortchi - Reinstate Monica Aug 18 '20 at 16:40
  • I really appreciate this informative insight.. I have come across penalizing regression through my readings. I am unfamiliar of this process. I believe I will have to take a further look on the book and the paper you attached. Thank you – Ala' Shaban Aug 18 '20 at 16:46
0

It is always a good idea to investigate the added value of clinical knowledge in building a statistical model. At the same time, much statistics is performed to obtain new knowledge regarding (yet) unknown relations between factors in the domain under study.

My advice is to pursue a two-fold approach

  1. Build one regression model based on the clinical knowledge available;
  2. Use a statistical package like SPSS, SAS, R to perform a variable search for a well-performing regression model. These packages have such algorithms built in.

After you have found a suitable empirical model (2), it is interesting to compare that with the model (1) based on existing clinical knowledge.

Match Maker EE
  • 1,701
  • 4
  • 15
  • Thank you, this is really useful; I will consider this approach. I am curious to know what are your thought about performing bivariate analysis then including variables which meet an arbitrary alpha level e,g, 0.2? – Ala' Shaban Aug 18 '20 at 08:02
  • I would recommend you to use a greedy sequential forward search approach instead. As stated above, statistical software packages have such an approach included in their routines. – Match Maker EE Aug 18 '20 at 10:14
  • Thank you for the help, I believe I will consider this approach. – Ala' Shaban Aug 18 '20 at 16:34
0

Assuming that you are interested in estimating a causal relationship, and not just in predicting an outcome, the choice of what covariates to include should always be guided by a causal model constructed from subject-matter knowledge.

Many people find directed acyclic graphs (DAGs) a helpful tool to make the assumed model explicit and help decide which variables should go into the estimation. Generally speaking, you want to include variables that are potential confounders of the treatment you are interested in, that is, potential causes of both the treatment and outcome.

Adjusting for variables that are not confounders will not always reduce bias and may often make it worse (at the very least, it will make your estimates much harder to interpret).

pengzell
  • 11
  • 2
  • Thank you for this informative response. If I consider building a model guided by subject knowledge, do I need to introduce any justifications in my thesis paper or is it generally reasonable? – Ala' Shaban Aug 18 '20 at 08:31
  • Having a theoretical justification is always good, but how much of it you should spell out will differ depending on field, audience, publication format, or even the research question. Here I would listen to your advisor or trusted peers. – pengzell Aug 18 '20 at 08:42
  • Thank you for the help. I will discuss this with the advisor. – Ala' Shaban Aug 18 '20 at 16:35