Cross-validation and logistic regression

Question

I'm interested in building a set of candidate models in R for an analysis using logistic regression. Once I build the set of candidate models and evaluate their fit to the data using AICc (aicc = dredge(results, eval=TRUE, rank="AICc")), I would like to use k-fold cross fold validation to evaluate the predictability of the final model chosen from the analysis. I have a few questions associated to k-fold cross validation:

I assume you use your entire data set for initially building your candidate set of models. For example, say I have 20,000 data values, wouldn't I first build my candidate set of models based on the entire 20,000 data values? Then do use AIC to rank the models and select the most parsimonious model?
After you select the final model (or model averaged model), would you then conduct a k-fold cross validation to evaluate the predictability of the model?
What is the easiest way to code a k-fold cross-validation in R?
Does the k-fold cross validation code break up your entire data set (e.g., 20,000 data values) into training and validation sets automatically? Or do you have to subset the data manually?

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

3

Your current strategy will lead to overfitting. Note that dredge is essentially a form of best subsets selection. (The function name is rather evocative.) Such procedures are ill-advised in general (see my answer here: Algorithms for automatic model selection).

In addition to overfitting, only cross-validating the selected model will give you an over-optimistic estimate of the model's out of sample performance. Instead, you could include the entire model selection process in the cross-validation. For example, imagine you are doing 10-fold cross validation. On your first iteration, you would use the first nine folds to fit the models and select the best one, the selected model would then be applied to the tenth fold to assess its out of sample performance. Note that the models selected in this way may differ from one iteration to the next. This approach tells you the out of sample performance of a model selected in this way, rather than the out of sample performance of a particular model that has already been selected.

Regarding how to do this in R, there are a number of pre-existing functions and packages to help you with cross-validation. There is a helpful overview of several options here (pdf). You may also want to check out the caret package. To do some form of customized cross-validation, you may need to code it up yourself, though.

edited Apr 13 '17 at 12:44

Community

1

answered Jan 31 '15 at 17:59

gung - Reinstate Monica

132,789
81
357
650

I agree with your answer posted under the Algorithims for automatic model selection. There are many reasons (as you eluded to) for why a stepwise regression approach is ill-advised.I still have a couple questions: 1) it sounds like you can integrate both model selection and k-fold cross validation into a similar set of processes. Is this correct? I assumed you needed to first build the most parsimonious model then evaluate the predictability of the model using k-fold cross validation. – Buck2079 Jan 31 '15 at 18:41
2) do you use the master data set from the beginning of the model selection and k-fold cross validation? For example, if I have a data set that contains 20,000 data values, I assume I would input the entire data set into R then build my models and integrate k-fold cross validation to evaluate the predictability of the set of candidate models. I'm just not that familiar with k-fold cross validation, especially in the context of model selection. I currently have 9 different models (candidate set) and want to determine which model best fits the data and then determine the – Buck2079 Jan 31 '15 at 18:48
predictability of the final model using k-fold cross validation. – Buck2079 Jan 31 '15 at 18:49
@AndyLittle, yes you integrate the model selection into the CV. This doesn't fix all the problems of automatic model selection, but would at least give you a fair assessment of the final model. To do this, you split your data into k folds before doing anything else. The entire model fitting & selection would be done on the training folds, & only the selected model would be assessed in the validation fold. This process would be repeated k times, & the k out of sample performance estimates would be averaged. [This Q](http://stats.stackexchange.com/q/1826/7290) is helpful for understanding CV. – gung - Reinstate Monica Jan 31 '15 at 20:57
If I choose a 10-fold cross validation, will R break up the data set into 10 folds? Or do I need to break my data in Excel into each fold? Is there an easy way to have R break the data set up? – Buck2079 Jan 31 '15 at 22:57
@AndyLittle, it depends on which R function / package you call to do it. You'll have to read the specifics of the one you use. I wouldn't ever do it in Excel, though. – gung - Reinstate Monica Jan 31 '15 at 22:59

Cross-validation and logistic regression

1 Answers1