0

So I'm trying to fit some binary outcome data to a logistic regression model. Besides the binary outcome I have several different metrics (numeric, integers, as well as factors) associated with each case (and outcome). Now, the idea is as usual to get the best model describing the data without overfitting of course.

I'm using R for this, so just to try it out, and getting the data well organized I use the glm function. I can use this to create a model using all variables (not a good one), or I can choose which ones I would like to use. But how does one determine which ones should be used ? I know I can use AIC values to see if one is better than another, but I have many metrics I can use, so that would result in a lot of different models to try out. And I don't think that is the way to use AIC.

So yeah, what is the basic approach in situations like this ? Do I run the glm function on a single variable at a time, and see if that has any significance, and then choose from there, or are there other more effective approaches ?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Denver Dang
  • 787
  • 4
  • 15
  • 1
    This is a big topic: [feature](https://stats.stackexchange.com/questions/tagged/feature-selection?sort=votes&pageSize=50)/variable/[model](https://stats.stackexchange.com/questions/tagged/model-selection) selection. You might want to read some relevant posts & focus your question. – Scortchi - Reinstate Monica Nov 13 '18 at 16:40
  • Is there perhaps some literature/posts you could recommend ? – Denver Dang Nov 13 '18 at 16:42
  • 1
    As you're concerned with generalized linear models, [Algorithms for automatic model selection](https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection) & [Harrell (2015), *Regression Modelling Strategies*](https://www.springer.com/gb/book/9783319194240) (see http://biostat.mc.vanderbilt.edu/wiki/Main/RmS). – Scortchi - Reinstate Monica Nov 13 '18 at 16:50
  • 1
    Note that what you propose in your last paragraph is (sometimes) called *univariable screening*, & Harrell writes that it's "even worse than stepwise modelling because it can miss important variables that are only significant after adjusting for other variables". – Scortchi - Reinstate Monica Nov 13 '18 at 17:13

0 Answers0