7

Looks like the rationale behind the accepted answer of this post is incorrect.

Under leave one out cross validation(LOOCV), the variance of its MSE estimator is $$var [\frac{\Sigma_i x_i}{n}] = \frac{var[\Sigma_i x_i]}{n^2}$$ where $x_i$ is an estimate of MSE from one particular iteration.

I agree that LOOCV has a higher enumerator (b/c of the covariance terms), but the denominator is larger as well because there are essentially n estimates (greater than k estimates as in the k-fold case).

Given this, why does LOOCV still have higher variance in estimating MSE and why does it have lower bias?

(This is against intuition b/c increasing sample should decrease variance and leaves bias unchanged for $\hat\theta$ and $\hat{y}$)

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • $k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate *always* has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here. – amoeba Jul 23 '18 at 08:28
  • it is been a while, but lets talk it through. So for k-fold cross-validation, we have one estimate of MSE in each folder, so we have k estimates. The aggregate estimate we have is $\Sigma \hat{MSE_i}/k$. If there are n folds, then the n is the denominator. – denizen of the north Jul 23 '18 at 13:32
  • What you are missing is that an estimate of MSE in each fold is *itself* an average over $n/k$ estimates. So you have an average of $k$ numbers that are averages of $n/k$ numbers. That's the same as an average of $n$ numbers. Does it make sense? – amoeba Jul 23 '18 at 13:58
  • @amoeba I see your point. It probably can be used as a heuristic, but definitely not correct in a rigorous sense. The mean MSE is different for each fold. What you are proposing is essentially a global MSE estimate using the global mean. The whole point of k-fold MSE estimation is to have k estimates rather than one global estimate. – denizen of the north Jul 23 '18 at 14:05
  • No. I strongly disagree with this assessment. – amoeba Jul 23 '18 at 14:09
  • @amoeba I have edited my answer: *Varying K does not have a direct, algebraically straightforward impact on the variance of the CV estimator* and *These two effects are influenced by the value of K which explains why different datasets and models will lead to different behaviours* - This question and answer clarifies a lot of confusion on the topic I think and should be linked to more often – Xavier Bourret Sicotte Sep 04 '18 at 10:12
  • @XavierBourretSicotte thanks for pointing it out. That is exactly the point i am trying to make. If it is the summation sign or the formula to expand the variance of the summation of random variables that got people caught up, there is really no point in proceeding the argument... – denizen of the north Sep 04 '18 at 21:15
  • I don't understand why you and Amoeba don't agree - there is no K in the formula so how could K have an impact on the bias or variance ? The impact comes from the correlation / similarity / stability of the training sets and the algorithm – Xavier Bourret Sicotte Sep 04 '18 at 22:18
  • @XavierBourretSicotte Your answer, with which I agree, has my +1. Frankly, I am puzzled why DenizenOfTheNorth does not agree with us. The premise of this question ("the denominator is larger as well because there are essentially n estimates (greater than k estimates as in the k-fold case)") is simply a misunderstanding. – amoeba Sep 07 '18 at 07:05

2 Answers2

4

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the 2004 study by Bengio & Grandvalet which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

enter image description here

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$. Each data point $x_i$ contributes to an error term $\epsilon_i$ which are summed up into the MSE. Varying $K$ does not have a direct, algebraically straightforward impact on the variance of the CV estimator.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. These two effects are influenced by the value of $K$ which explains why different datasets and models will lead to different behaviours,

You will need to read through the extensive (and technical) literature to really grasp the subtlety and special cases.

Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72
1

In addition to Xavier's answer:

why does LOOCV still have higher variance in estimating MSE and why does it have lower bias

  • Do we know whether the situations where high variance for LOO is observed are the same where it also has low bias?

  • I'm aware of one particular situation (LOO, small sample size, classifiers that take into account relative frequency of classes in training) which causes high pessimistic bias. Now for certain highly used figures of merit such as % misclassifications or accuracy or a similar fraction of tested cases, the large pessimistic bias often leads to larger variance because for these figures of merit the variance is mathematically tied to their size (with highest variance for 50 %).

  • The "low [pessimistic] bias of LOO" is usually argued from the learning curve: if, instead of taking away $m$ training cases, only 1 training case is left out, the surrogate models are on average less worse than those where $m$ cases were removed. Implied: All other things being equal (which is not the case in the peculair situation above).
    One would also expect the instability (B&G's ω variance component) to be lower for LOO as more training cases are available. Unfortunately, for LOO this cannot be measured.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133