How to select model based on Cross Validation?

Question

I always thought we select the model with the lowest cross validation , since cross validation is the estimate of prediction error.

However, there is the "one standard error" rule according to "Elements of Statistical Learning", which choose the most parsimonious model whose error is no more than one standard error above the error of the best model.

Question: what is the rationale of the "one standard error" rule ? And why not choose the model with lowest CV ?

EdM · Answer 1 · 2017-06-27T17:11:53.547

4

An Introduction to Statistical Learning puts the rationale pretty succinctly on page 214:

The rationale here is that if a set of models appear to be more or less equally good, then we might as well choose the simplest model—that is, the model with the smallest number of predictors.

Note that the one-standard-error rule is recommended in cases for which the relation of cross-validated error to number of predictors is "quite flat." It also takes into account the following:

Furthermore, if we repeated the validation set approach using a different split of the data into a training set and a validation set, or if we repeated cross-validation using a different set of cross-validation folds, then the precise model with the lowest estimated test error would surely change.

So why not choose the simplest useful model in such a case? The one-standard-error rule is a rule of thumb, a way to get a reasonably simple model. I am not aware of any further basis for this choice based on first principles.

This Cross Validated question goes into a good deal more detail on this issue.

edited Jun 27 '17 at 17:11

answered Jun 27 '17 at 16:57

EdM

57,766
7
66
187

1

A few reasons to pick the simplest model: the fewer variables used, the less problem missing data is likely to be, the smaller the data record needed for prediction, the lower the data collection burden, the easier it may be to describe the model. None of these are hard and fast rules. – zbicyclist Jun 27 '17 at 17:09
Great answer. Just trying to understand what's the difference between "using a different split of the data into a training set and a validation set" and "using a different set of cross-validation folds" ? Don't they mean the same thing, i.e. different values of $k$ in k-fold CV give different results? – mynameisJEFF Jun 27 '17 at 17:15
A training-validiation split take the data and splits (once) into Tv. Crossvalidation takes the data and splits into TTTTt, TTTtT, TTtTT, TtTTT and tTTTT. Different results (from the original training-validation or cross-validation sets) will result from different random number seeds to start. – zbicyclist Jun 27 '17 at 18:47

How to select model based on Cross Validation?

1 Answers1