Is leave-one-out cross validation (LOOCV) known to systematically overestimate error?

Question

Let's assume that we want to build a regression model that needs to predict the temperature in a build. We start from a very simple model in which we assume that the temperature only depends on weekday.

Now we want to use k-fold validation to check if our hypothesis is valid. Now, for each weekday we calculate mean temperature using the whole data set. However, when we do the leave-one-out validation, we take one observation and calculate the mean without this particular observation. As a result, whenever an observation goes up, the corresponding prediction (mean calculated with the remaining values) goes down. So, we have an anticorrelation between observations and predictions and it should obviously decrease the accuracy of the model.

So, my question is: Is it a known effect and how to deal with it?

*"we take one observation and calculate the mean without this particular observation"* The magnitude of the effect obviously depends on both your training-set size $n_{training}$ and number of folds $K$. Each fold's size will be $n_{training} / K$ — smci, Dec 09 '19 at 21:06

score 15 · Answer 1 · answered Dec 09 '19 at 11:42

Given that the size of the training sample ($n_{training}$) is smaller than the size of the entire sample ($n$) $$ n_{training}<n, $$ the parameter estimates based on training subsamples in CV (be it LOO or K-fold) will in expectation be less accurate/precise than these based on the entire sample. This will cause the prediction loss from the model estimated on the entire sample to be overestimated. For LOOCV, the difference will usually be small, since $n_{training}$ and $n$ differ by very little (by $1$); for K-fold CV it will be larger.

+1 and to answer the OP's question about whether this is a known effect, see The Elements of Statistical Learning figure 7.8 (learning curve) and section 7.10.1 K-fold cross validation — Adrian, Dec 09 '19 at 20:48

score 13 · Accepted Answer · answered Dec 08 '19 at 13:01

This effect not only occurs in leave-one-out but k-fold cross-validation (CV) in general. Your training and your validation sets are not independent because any observation being allocated to your validation set obviously influences your training set (since it is being taken out from it).

To which extend this is the case depends on your data and predictor. To make a very simple example using your task regarding the daily temperature using leave-one-out: If your data only contained a single (the same) value $n$ times, then your mean predictor would always predict the correct value in all $n$ folds. And if you used a predictor taking the maximum value from the training set (for prediction and calculating the true values), then your model would be correct in $n-1$ folds (only the fold which removes the maximum value from the train dataset would be predicted incorrectly). I.e. there are predictors and datasets where leave-one-out may be more or less suitable.

Specifically your mean-estimator has two properties:

It depends on all examples in the train set (i.e. in real world non-trivial datasets (unlike my example above) it will predict a different value in each fold). A maximum-predictor, for example, would not show this behavior.
It is sensitive to outliers (i.e. removing an extremely high or low value in one of the folds will have a relatively large impact on your prediction). A median-predictor, for example, would not show this behavior to the same extent.

This means your mean-predictor is somewhat unstable per design. Which you can either accept (especially in case the observed variance is not significantly large) or choose a different predictor instead. However, as pointed out earlier this also depends on your dataset. If your dataset is small and of high variance, instability of the mean-predictor will increase. So having a sufficiently sized dataset with proper pre-processing (potentially removing outliers) could be another way to approach this. Also, I'd keep in mind that there is no perfect method to measure accuracy.

The paper A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection is a good starting point for this topic. It focuses on classification but will still be a good read to get more details and further readings on the topic.

score 4 · Answer 3 · answered Dec 09 '19 at 05:54

You aren't adding negative correlation correlation between observation and mean, you're taking out positive correlation between observation and mean. The whole problem with not doing cross-validation is that if you have n data points, then each time you do a prediction for one of the data points, 1/n of the prediction is coming from itself, so you're over estimating accuracy by an amount proportional to 1/n. If you take the point out, you're getting rid of that overestimation, and so the lower accuracy isn't overestimating error, it's getting a more valid estimate of error.

Consider rolling a die ten times, and you're trying to predict the roll $x_i$. Let $\bar x$ be the mean of all ten rolls, and $\bar{x'}$ be the mean of the nine rolls excluding $x_i$. Clearly, there's correlation between $x_i$ and $\bar x$. Try simulating 100 trials of rolling ten dice, and comparing the first roll to the overall mean for each trial. While the expected value of $\bar x$ is 35, there's going to be some variation, and the trials in which the average is higher than 35 are, more likely than not, trials in which the first roll is higher than 3.

There is, however, no correlation between $x_i$ and $\bar{x'}$. The other two rolls are independent of $x_i$. While there is negative correlation between $x_i$ and the change between $\bar x$ versus $\bar {x'}$, this is just reflecting the fact that $\bar x$ had a correlation that you're getting rid of.

assume that I through a coin 10 times and, assume, that I have 5 heads and 5 tails. The mean is 0.5. However, when I observe 1, the mean over remaining cons (with four 1s and five 0s) will be smaller than 0.5. Similarly, for each 0, the mean over the remaining observations will be larger than 0.5. — Roman, Dec 09 '19 at 08:02

score 3 · Answer 4 · answered Dec 09 '19 at 14:28

The answer to both questions is yes:

yes, LOO does have a pessimistic bias, and
yes, the described effect of additional pessimistic bias is well known.

Richard Hardy's answer gives a good explanation of the well-known slight pessimistic bias of a correctly performed resampling validation (including all flavors of cross validation).

However, the mechanism discussed in the body of the question, namely that removing a case that is in some sense extreme will give a test/training subset split where the training subset is particularly un-representative of the subset to be tested. This can cause additional error as Sammy explained already. So the reason for this high error is that predictive performance deteriorates extremely fast for cases just outside (or at the edge) of training space.

What to do against this effect?

There are different points of view on such a situation, and it will depend on your judgment of the task at hand which one applies and what to do about this.

On the one hand, this may be seen as an indication of the error to be expected for application cases similarly extreme (somewhat outside training space) - and encountering such cases during resampling can be seen as indication that similarly extreme cases for the model built on the whole data set will be encountered during production use.
From this point of view, the additional error is not a bias but an evaluation including slight extrapolation outside training space which is judged to be representative of production use.
On the other hand, it is perfectly valid to set up a model under the additional constraint/requirement/assumption that no prediction should be done outside training space. Such a model should ideally reject prediction of cases outside its training domain, LOO error for predicted test cases of such a model would not be worse, but would encounter a lot of rejects.

Now, one can argue that the mechanism of leave one out produces an unrepresentatively high proportion of outside training space cases due to the described opposite influence on training and test subset populations. This can be shown by studying the bias and variance properties of various $k$ or $n$ for leave-$n$-out and $k$-fold cross valiation, respectively. Doing this, there are situations (data set + model combinations) where leave one out exhibits a larger pessimistic bias that would be expected from leave-more-than-one-out. (see Kohavi paper linked by Sammy; there are also other papers reporting such a behaviour)
I may add that as leave-one-out has other undesirable properties (conflating model stability wrt. training cases with random error of tested cases), I'd anyways recommend against using LOO whenever feasible.

Stratified variants of resampling validation produce by design more closely matching training and test subpopultation, they are available for classification as well as regression.
Whether it is appropriate or not to employ such a stratification is basically a matter of judgment about the task at hand.
However, leave one out differs from other resampling validation schemes in that it does not allow stratification. So if stratification should be employed, leave one out is not an appropriate validation scheme.

When does this particular pessimistic bias occur?

This is a small sample size problem: in the described model, as soon as there are sufficient cases in each weekday "bin" so that even leaving out an extreme case leads to a fluctuation of the training mean that is << the spread of temperatures for that weekday, the effect on observed error is negligible.
High dimensional input/feature/training space has more "possibilities" for a case to be extreme in some direction: In high dimensional spaces, most points tend to be at the "outside". This is related to the curse of dimensionality.
It is also related to model complexity in the sense that high error for edge cases is an indication that the model is unstable right away outside the training region.

Is leave-one-out cross validation (LOOCV) known to systematically overestimate error?

4 Answers4

What to do against this effect?

When does this particular pessimistic bias occur?