1

I have a dataset of 1931 observations and I intend to predict a binary outcome out of that. There is a list of 128 predictors (both binary and continuous). First I ran logistic regression modeling using all predictors and got a highly significant model (AUC = 0.84). Assuming that the high value of AUC was due to overfitting the model by using too many predictors, I did stepwise modeling to find the effective predictors:

mylogit <- glm(outcome ~ . , data = temp,family="binomial")
step <- step(mylogit, direction="both")

Now, I am not sure whether should I have done cross validation before or after stepwise modeling.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
user30314
  • 103
  • 1
  • 2
  • 8
  • 1
    See also [here](http://stats.stackexchange.com/questions/64991/) & [here](http://stats.stackexchange.com/questions/5918/). Any kind of outcome-based model selection has to be repeated as part of each training fold to get a fair estimate of the out-of sample performance of the whole procedure. – Scortchi - Reinstate Monica Apr 22 '15 at 08:53

0 Answers0