Criteria for choosing the most appropriate logistic regression model

Question

I've fitted 16 logistic regression models to my data and I'm not sure as to which model to choose as my final model. I looked at a couple of things to help me choose my final model. 1) significance of the predictor variables 2) AIC and BIC 3) area under the ROC curve 4) cross validation error. A model with low AIC or BIC value may have a smaller area under the ROC curve than another model with a higher AIC or BIC. And the model with largest area under the ROC curve may have a higher cross-validation error than a model with a smaller area under the ROC curve. I was able to narrow the 16 possible models down to 4-5 models but I can't make the final decision based on the criteria I used (1-4) since they seem to conflict with one another. How can I go about choosing the best logistic regression model for my data?

(1) What's your model to be used for, & what's the goal of selection? - this determines what "best" means. (2) Do the models contain different nos free parameters? - note that the area under the ROC curve isn't penalized for model complexity. (3) How are you measuring cross-validation error? - avoid measures based on classification using arbitrary cut-offs. (4) If you use any response-based criterion, picking the "best" from 16 without shrinkage introduces bias into the predictions & coefficient estimates - you can estimate how much by cross-validating the *whole* selection process. — Scortchi - Reinstate Monica, Apr 10 '14 at 09:00
@Scortchi Thanks for the response. 1) I'm looking for a model with high prediction power. The model is actually for determining the diagnosis of an illness. So given the patient's symptoms, I want a model that can predict the patient's diagnosis as accurately as possible. 2) What are nos free parameters? 3) I'm using a K-fold cross validation where K=2. My sample size is 284 4) By cross validating the whole selection process, do you mean performing a cross validation on each and every one of the 16 models? — Adrian, Apr 10 '14 at 17:31
(2) Free parameters are the intercept & log odds ratios you estimate in the regression model: models with more of these tend to fit *the data they're fitted on* better than models with fewer (necessarily so when the simpler model's contained within the more complex); but may fit *new data* worse. (3) What I mean is:- How are you measuring "error" in the test set of each CV fold? (4) I mean e.g. in each cross-validation fold pick the model with lowest AIC from the training set & score it on the test set. — Scortchi - Reinstate Monica, Apr 11 '14 at 10:20

Criteria for choosing the most appropriate logistic regression model

0 Answers0