Why is k-fold cross validation a better idea than k-times resampling true validation?

Question

I'm currently working through a machine learning textbook and just read a bit about k-fold cross validation, and I am wondering the following. I want to estimate a parameter, e.g. a penalty parameter for a penalized likelihood method. In order to do this, I can do two different things:

I sample the training data so that I get $k$ equally large folds, and for each fold I use the other folds as training data to get estimates for $y$ and I compare these estimates with the actual $y$ from the fold in question. This, I do for every interesting choice of my parameter, and choose the parameter which has the least error, averaged over all folds and all members of each fold.
I sample the training data so I get 2 equally large sets, one of which I use as training data to predict the error of the other set. For every interesting lambda, I note the average error. Then, I re-sample the data so I get 2 (different) equally large sets, where I repeat the above procedure. I sample $k$ times in total, and average over these to get an estimate to the best parameter.

The second approach looks rather naive, and I am wondering if there is something wrong with it. Are there reasons, generally speaking, why one would prefer method 1 over method 2? Are there computational reasons, or even statistical ones?

Dikran Marsupial · Accepted Answer · 2014-01-10T14:15:12.470

The problem with the second approach is that the training set is smaller (half of the available data) than for the cross-validation approach ((k-1)/k of the available data). As most learning algorithms perform better the more data they are trained on, this means that the second approach gives a more pessimistically biased estimate of the performance of a model trained on all of the available data than the cross-validation based approach. Taken to its extreme, where k is the size of the available dataset (i.e. leave-one-out cross-validation) gives an almost unbiased estimate of generalisation performance.

However, as well as bias (whether the estimate is systematically wrong), there is also variance (how much the estimate varies depending on the selection of data over which it is calculated). If we use more data for training, this also reduces the variability of the performance of the resulting model, but it leaves less testing data, so the variance of the performance estimate increases. This means that there is usually a compromise between variance and bias in determining how much data can be used for training and for testing in each fold (i.e. in practice, leave-one-out cross-validation isn't optimal as while it is almost unbiased, it has a high variance, so the estimator has a higher error).

The more folds of the resampling procedure we use, the more we can reduce the variance of the estimator. With split sampling that is easy, just increase the number of folds. For cross-validation, we can perform cross-validation repeatedly, choosing a different partition of the data into k disjoint subsets each time and average. I often perform 100 random test-training splits (i.e. the second approach) but use a 90%/10% split between training and test data to reduce the bias of the estimator.

"The more folds of the resampling procedure we use, the more we can reduce the variance of the estimator": I'd have said: the more we can reduce the variance of the estimator that is due to model instability. The variance that is caused by the finite number of *distinct* cases tested cannot be reduced by this strategy (at least for iterated/repeated k-fold CV). — cbeleites unhappy with SX, Jan 10 '14 at 16:23

score 4 · Answer 2 · edited Apr 13 '17 at 12:44

@Dikran has already provided a detailed analysis. Cross validation helps you with model selection. According to Hoeffding inequality, the expected out-of-sample error can be estimated based on your validation error: $E_{out} \leq E_{val} + O(\sqrt \frac{lnM}{K})$ where $M$ is the model number, and $K$ is the number among $N$ samples picked for validation. As you can see the larger $K$ may make the out of sample error better bounded in estimation. On the other hand, however, when you draw the learning curve, you many find small training number ($N-K$) lead to both large in-sample error and the validation error (bias), as the training number increases, two curves finally converges. So there is a trade-off between $K$, and $N-K$ actually, and the rule of thumb is usually $K = \frac{N}{10}$.

One more thing (might be off topic a little bit): more training samples may not reduce the variance, see my answer here.

"more training samples may not reduce the variance": more training samples can only help if the total variance is dominated by model instability. If the total variance is dominated by the variance caused by the finite number of test cases only increasing the number of test cases can help. In other words: the total sample size, which would then also allow more training cases in the usual k-fold cross validation scheme. — cbeleites unhappy with SX, Jan 10 '14 at 16:29

Why is k-fold cross validation a better idea than k-times resampling true validation?

2 Answers2

Linked