0

I've fitted 16 logistic regression models to my data and I'm not sure as to which model to choose as my final model. I looked at a couple of things to help me choose my final model. 1) significance of the predictor variables 2) AIC and BIC 3) area under the ROC curve 4) cross validation error. A model with low AIC or BIC value may have a smaller area under the ROC curve than another model with a higher AIC or BIC. And the model with largest area under the ROC curve may have a higher cross-validation error than a model with a smaller area under the ROC curve. I was able to narrow the 16 possible models down to 4-5 models but I can't make the final decision based on the criteria I used (1-4) since they seem to conflict with one another. How can I go about choosing the best logistic regression model for my data?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Adrian
  • 105
  • 1
  • 1
  • 5
  • (1) What's your model to be used for, & what's the goal of selection? - this determines what "best" means. (2) Do the models contain different nos free parameters? - note that the area under the ROC curve isn't penalized for model complexity. (3) How are you measuring cross-validation error? - avoid measures based on classification using arbitrary cut-offs. (4) If you use any response-based criterion, picking the "best" from 16 without shrinkage introduces bias into the predictions & coefficient estimates - you can estimate how much by cross-validating the *whole* selection process. – Scortchi - Reinstate Monica Apr 10 '14 at 09:00
  • @Scortchi Thanks for the response. 1) I'm looking for a model with high prediction power. The model is actually for determining the diagnosis of an illness. So given the patient's symptoms, I want a model that can predict the patient's diagnosis as accurately as possible. 2) What are nos free parameters? 3) I'm using a K-fold cross validation where K=2. My sample size is 284 4) By cross validating the whole selection process, do you mean performing a cross validation on each and every one of the 16 models? – Adrian Apr 10 '14 at 17:31
  • (2) Free parameters are the intercept & log odds ratios you estimate in the regression model: models with more of these tend to fit *the data they're fitted on* better than models with fewer (necessarily so when the simpler model's contained within the more complex); but may fit *new data* worse. (3) What I mean is:- How are you measuring "error" in the test set of each CV fold? (4) I mean e.g. in each cross-validation fold pick the model with lowest AIC from the training set & score it on the test set. – Scortchi - Reinstate Monica Apr 11 '14 at 10:20

0 Answers0