Lots of training data in cross validation -> overfitting?

Question

So I've read conflicting information and am confused. I read that more relevant training data will decrease the variance of a model.

However, I read about how in k-fold cross validation, if k is very large, and you train on k-1 of the folds, and test on 1 fold, then the variance will likely be really high, since the training size is huge.

Which one of these is true?

This is a pretty popular question. The link I've shared in my other comment is relevant and highly cited on the forum. — Demetri Pananos, Sep 23 '20 at 22:49

score 1 · Answer 1 · answered Sep 23 '20 at 22:59

The choice of larger $k$ in $k$-fold cross-validation (CV) does not give you more training data. If you do $k$-fold CV, ultimately, $k-1$ times, every one of your data points is used for training, so the number of training data points is the same, whatever $k$ you choose.

If you look at a single run in an isolated manner, indeed higher $k$ gives you more training data than lower $k$, and then actually the training is more accurate than for lower $k$. However, this effect is countered by the fact that if you run this on all $k$ folds, the results are highly correlated, because there is a big overlap between the different training sets. Granted, they are all large, but because you always have the same number of data points overall, these large training datasets have big overlap, which increases the variance of the overall $k$-fold procedure where you run all $k$ folds.

So the two statements you read look correct (obviously I can't know their precise context), but they are not conflicting.

(Well, let's say... the second statement with "really" high variance seems exaggerated... the overall variance of this even with very large $k$ may still be OK, though probably not optimal. But one cannot simply say that the variance gets smaller if $k$ gets larger, for the reason given above.)

Lots of training data in cross validation -> overfitting?

1 Answers1