17

In leave-one-out cross-validation (LOOCV), each of the training sets looks very similar to the others, differing in only one observation. When you want to estimate the test error, you take the average of the errors over the folds. That average has a high variance.

Is there a mathematical formula, visual, or intuitive way to understand why that average has a higher variance compared with the $k$-fold cross validation?

amoeba
  • 93,463
  • 28
  • 275
  • 317
xyzzy
  • 823
  • 2
  • 8
  • 7

5 Answers5

20

The original version of this answer was missing the point (that's when the answer got a couple of downvotes). The answer was fixed in October 2015.

This is a somewhat controversial topic.

It is often claimed that LOOCV has higher variance than $k$-fold CV, and that it is so because the training sets in LOOCV have more overlap. This makes the estimates from different folds more dependent than in the $k$-fold CV, the reasoning goes, and hence increases the overall variance. See for example a quote from The Elements of Statistical Learning by Hastie et al. (Section 7.10.1):

What value should we choose for $K$? With $K = N$, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the $N$ "training sets" are so similar to one another.

See also a similar quote in the answer by @BrashEquilibrium (+1). The accepted and the most upvoted answers in Variance and bias in cross-validation: why does leave-one-out CV have higher variance? give the same reasoning.

HOWEVER, note that Hastie et al. do not give any citations, and while this reasoning does sound plausible, I would like to see some direct evidence that this is indeed the case. One reference that is sometimes cited is Kohavi 1995 but I don't find it very convincing in this particular claim.

MOREOVER, here are two simulations that show that LOOCV either has the same or even a bit lower variance than 10-fold CV:

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 2
    could you give the intuition for regression too? – xyzzy Mar 21 '14 at 18:06
  • 1
    @xyzzy: Can you elaborate on what regression problem you have in mind and what exactly does cross-validation have to do with it? In case I misunderstood you question, maybe you can update it. I thought you are talking about a classification problem because you mentioned "error rate". – amoeba Mar 21 '14 at 20:09
  • 1
    this is a question for basic intuition. I think you can do cross-validation on any regression problem though, you still get prediction errors which you can measure as mean square error.Perhaps it is best intuited as the classification problem, though, and the regression problem would be similar in that the variance rate for k-fold would be N/k times smaller? – xyzzy Mar 21 '14 at 23:20
  • 1
    @xyzzy: Yes, I think the same intuition applies. In k-fold CV you are considering N/k samples in each test set instead of only 1, so averaging your prediction errors over these N/k samples (to get mean prediction error in each fold) will lead to the reduction of variance (over these mean prediction errors) by N/k. The crucial point here is again that the more samples you have in your test set, the more precise estimate of prediction/classification error you get in each fold, and the more precise the estimates are, the smaller their variance. – amoeba Mar 22 '14 at 00:13
  • 2
    But the average over all $k$ and $n$ folds, respectively averages the same number of cases... – cbeleites unhappy with SX Mar 22 '14 at 04:27
  • 2
    @cbeleites: Yes, certainly. I understood the question as asking about the variance *over folds*, not *over repetitions*. Maybe OP could clarify what he or she meant. – amoeba Mar 22 '14 at 12:59
  • 2
    I believe I meant higher variance in the mean estimate over all the folds (for LOOCV vs k-fold). I am trying to distinguish a comment that I heard that LOOCV has a higher variance in the mean error because the training sets are all highly correlated. I am not sure how to intuit why this is so. I wonder if its a combination of both the small sample size (1), which amoeba and cbeleites alluded to, plus also having something to do with the correlation in all the train groups, but still not able to intuit that. Hope that is clear. – xyzzy Mar 22 '14 at 22:04
  • 3
    This answer shows that the variance of *a single estimate* is higher for LOO than for k-fold. But if I'm not mistaken, in practice the final estimate is taken to be the average of the estimates across all k folds (with k=n in the case of LOO). So the relevant variance is the variance of the *mean* of the k estimates, right? In which case, for your example of LOO vs. 10-fold, both variance expressions reduce to $p(1-p)/N$ and thus are equal. This would also agree with Corollary 2 here: http://ai.stanford.edu/~ronnyk/accEst.pdf . Care to comment on this? Have I misunderstood something? – Jake Westfall Jul 20 '15 at 20:44
  • 1
    I see now that my comment ignored the covariance between the estimates being averaged. But in any case, it is the variance of the means that is of interest, yes? – Jake Westfall Jul 20 '15 at 22:09
  • 1
    @Jake, you are right, my answer (from over a year ago) does not make a lot of sense; I have already noticed it myself but forgot to deal with it. Funny that it got 12 upvotes :-/ I will update it when I have some time, but I actually don't have a very good understanding of the matter. I know that people say that high variance of LOOCV is due to test sets being almost the same (see the quote from Brash'es answer, +1), and it sort of makes sense, but this whole issue is not completely clear to me. – amoeba Jul 21 '15 at 09:04
  • 3
    @amoeba I've been looking into this and have found many conflicting statements from various sources about whether it's true. Most sources just have a stock statement about the estimates being correlated and then maybe cite ESL. At least one says it doesn't matter (see previous cite). Other sources explicitly say the opposite (e.g., p. 60 here: http://projecteuclid.org/euclid.ssu/1268143839). I ran a little simulation comparing number of folds $k$ = 2, 5, 10, $n$ which suggests that, at least for multiple regression, variance is smallest for $k=n$. Considering writing an answer with my findings – Jake Westfall Jul 21 '15 at 16:31
  • 1
    That's interesting, @Jake. I am currently traveling and have little time to work on this. But definitely consider posting an answer. Note that there are two older threads that are very much related to this one; perhaps this one should even be closed as a duplicate, but also maybe not. Here are the threads: [Number of folds for K-fold](http://stats.stackexchange.com/questions/61546) and [Model variance and bias in cross validation](http://stats.stackexchange.com/questions/61783). – amoeba Jul 21 '15 at 20:33
  • 1
    @Jake, I have now fixed my answer (prompted by a couple of downvotes), but have also voted to close this question as a duplicate of another one. I still think though that this topic requires a more thoughtful/elaborate answer than all of the existing ones. I am wondering if you have further experimented with this issue. – amoeba Oct 19 '15 at 23:02
  • 1
    @amoeba I finally got around to posting a question about this issue, you may want to check it out: https://stats.stackexchange.com/q/280665/5829 – Jake Westfall May 20 '17 at 01:15
9

From An Introduction to Statistical Learning

When we perform LOOCV, we are in effect averaging the outputs of $n$ fitted models, each of which is trained on an almost identical set of observations; therefore, these ouputs are highly (positively) correlated with each other. In contrast, when we perform $k$-fold CV with $k<n$, we are averaging the outputs of $k$ fitted models that are somewhat less correlated with each other, since the overlap between the training sets in each model is smaller. Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from $k$-fold CV.

To summarize, there is a bias-variance trade-off associated with the choice of $k$ in $k$-fold cross-validation. Typically, given these considerations, one performs $k$-fold cross-validation with $k=5$ or $k=10$, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Brash Equilibrium
  • 3,565
  • 1
  • 25
  • 43
2
  • In simple cases I think the answer is: the grand mean (over all test cases and all folds) has the same variance for $k$-fold and LOO validation.

  • Simple means here: models are stable, so each of the $k$ or $n$ surrogate models yields the same predicion for the same sample (thought experiment: test surrogate models with large independent test set).

  • If the models are not stable, the situation gets more complex: each of the surrogate models has its own performance, so you have additional variance. In that case, all bets are open whether LOO or $k$-fold has more additional variance*. But you can iterate the $k$-fold CV and taking the grand mean over all test cases and all $i \times k$ surrogate models can mitigate that additional variance. There is no such possibility for LOO: the $n$ surrogate models are all possible surrogate models.

  • The large variance is usually due to two factors:

    • small sample size (if you weren't in a small sample size situation, you'd not be worried about variance ;-) ).
    • High-variance type of error measure. All proportion-of-test-cases-type of classification errors are subject to high variance. This is a basic property of estimating fractions by counting cases. Regression-type errors like MSE have a much more benign behaviour in this respect.

For classification errors, there's a number of papers that looks at the properties of different resampling validation schemes in which you also see variances, e.g.:

(I guess similar papers may exist for regression errors as well, but I'm not aware of them)

* one may expect LOO to have less variance because the surrogate models are trained with more cases, but at least for certain types of classification models, LOO doesn't behave very well.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
-2

There are no folds in LOOCV like k-Fold Cross validation(actually they can be name as folds but meaningless). in LOOCV what it does is leave one Instance from the whole dataset for test data and use all other instances for training. So in each iteration it will leave one instance from the dataset to test.So in a particular iteration of evaluation there is only one instance in the test data and rest are in training data.that is why you saw all the training data sets equal all the time.

In K-fold Cross validation by using Stratification(an advanced method use to balance the data set ensuring that each class represents approximately in equal proportion in all the samples) we can reduce the variance of the estimates.

as LOOCV is using only one instance for testing, it cannot apply Stratification.So LOOCV has a higher variance in error estimates than k-fold cross validation.

  • -1. I don't see how stratification is relevant here. Do you have any references that support your point of view? – amoeba Jul 17 '18 at 21:04
-3

It's like taking a test with just one question - it's a lot more hit-and-miss.

This is an intuitive explanation of the standard deviation of an instance versus that of a mean - the score on a batch of instances has less variance.

Here are some more details.

danuker
  • 179
  • 5
  • And why is that? Can you expand a little more on this? Right now this is more of a comment than an answer. – Andy Jun 19 '15 at 11:08
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stats.stackexchange.com/help/whats-reputation) you will be able to [comment on any post](http://stats.stackexchange.com/help/privileges/comment). – Christoph Hanck Jun 19 '15 at 11:16
  • When you take a test with more questions, the score is averaged. And the variance of a mean is less than the variance on a single question, see more details here: [Standard deviation of the mean](https://en.wikipedia.org/wiki/Standard_deviation#Standard_deviation_of_the_mean). – danuker Jun 19 '15 at 12:06
  • @ChristophHanck It's an intuitive explanation, though it's not a complete answer. – danuker Jun 19 '15 at 12:07
  • That's why I suggested to post it as a comment instead. – Christoph Hanck Jun 19 '15 at 12:09