Basic question about cross-validation

Question

I try to understand how cross-validation (CV) works. What I wonder is the following:

For example, let's take a view at split2 and fold2. The model saw this data already in split1, so what is the benefit here now? Is it to create a kind of disruption as there is now no direct connection between fold1 and fold3 (when treating the data as a time series, for example)?

And furthermore, however, when the model is evaluated on these certain folds, what is the purpose of the test set when there is already a validation set during the training in each split? Thus, in split1, fold2-5 are training sets and fold1 is the validation set, right?

edit: A further question due to Paul Hewson's comment: When CV is meant to provide more data to train on, why not just repeating the whole procedure arbitrary times? After performing all splits, you just repeat the entire process again.

How does the model train on the data several times due to CV? After performing the first split, what does the model keep of that information? Let's consider a neural network: The weights have been adjusted accordingly and when the second split starts, will it enter this 2nd split with the weights from the 1st split? So is it a kind of weight initialization?

Some aspects are also addressed here: Cross-validation including training, validation, and testing. Why do we need three subsets? but though I get the principle, I don't get how this idea enters reality.

Re: "Why not repeat the procedure multiple times?". This is known as repeated cross-validation (as Lewian says using different partitions). It is, I believe, known to have lower bias and variance in estimating the true prediction error than cross-validation. It is probably best practice to do this unless unfeasible due to time or computational constraints. — alan ocallaghan, Oct 15 '19 at 16:41

Christian Hennig · Accepted Answer · 2019-10-15T16:06:58.303

5

1) Using 5 (or k) splits gives more information for finding your parameters than just using a single split. If your training sample has 500 observations, the first split evaluates the quality on fold 1, that's 100 observations. If you use all five, you can evaluate on 500 observations. Your skepticism is probably based on the fact that the data in the different splits is the same, so the analyses of the different folds are not independent. This is true, and actually one can therefore argue that using five splits doesn't give you five times as much information as a single one (in the past I have seen some literature that elaborates this but I don't know out of the top of my head how to find it), but still it gives you more information than a single one.

2) If you choose parameters by optimising the prediction error on your 5 splits (or, as is often done, on many splits into k folds), the prediction error on the training sample will be subject to selection bias; the parameters that perform optimally on the training sample will not necessarily be optimal on independent data. So you use an independent test sample to estimate the prediction error in an unbiased way (which also allows you to see to what extent the training error was optimistic).

3) (Regarding an added question:) "When CV is meant to provide more data to train on, why not just repeating the whole procedure arbitrary times? After performing all splits, you just repeat the entire process again." This can indeed be done and is in fact often done (using new random splits of course - using the same splits doesn't add information).

edited Oct 15 '19 at 16:06

answered Oct 14 '19 at 14:14

Christian Hennig

10,796
8
35

1) Why do I get more information while there is not more? 2) I have to change my question, what I meant is, what is the purpose of the test set when there is already a validation set during the training? – Ben Oct 15 '19 at 05:49
1) Well, split 1 uses folds 2-5 to predict fold 1, split 2 uses folds 1,3-5 to predict fold 2. This is different, and what happens in split 2 cannot be known from having done split 1, therefore it's adding information. You don't have more data but you do something different with your data. – Christian Hennig Oct 15 '19 at 15:54
2) This applies if you are comparing several methods regarding prediction error on the training set. If you then pick the winner, its prediction error on the training will be optimistic. (Text too long, will continue in new comment.) – Christian Hennig Oct 15 '19 at 16:00
It's the same if you roll a fair die 100 times and you observe that 6 is coming up most often, say 20 times instead of the expected 16.67. But 20/100=1/5 is then optimistic as an estimator for the probability of 6, and this is because 6 was picked because it came most often. Although the probability for every number is in fact 1/6, the one you observe most often will have a higher relative frequency. If you test the winner on independent data, the problem goes away. – Christian Hennig Oct 15 '19 at 16:00
I should probably add that selection bias is not solved by having "validation folds" already in training if your winner has optimised the prediction error that you obtained on the training including the "validation folds". – Christian Hennig Oct 15 '19 at 16:03

Paul Hewson · Answer 2 · 2019-10-17T11:52:21.783

I think the key thing to note is that split 1, split 2 etc. are completely independent in terms of knowledge of model fit from one split to another (1). The model fits (training and test) have no way of knowing what happened in split 1, it forgets it all and starts again.

In split 1, Fold 1 are the test data, and Folds 2, 3, 4, and 5 the training data. So you fit your model to 4/5ths of the data. The data in Fold 1 forms NO part whatsoever of the training process.
In split 2, Fold 2 are the test data, and Folds 1, 3, 4 and 5 are the training data. This is a completely independent model training exercise. The model hasn't seen fold 2 before. It hasn't seen any data before.

The idea is to be cost effective with your data. The more structure you have in your data, the more complex the cross-validation. Imagine a very simple linear regression with (x, y) pairs. You would end up with 5 sets of estimates of slope and intercept (y = intercept + slope x + error). But you would have to keep x and y together. The training phase estimates the intercept and slope, the test phase then estimates some y values from the x values, and you can then measure how well these predictions have been estimated. But you would need to keep the (x,y) pairs together. You would then have five sets of estimates of prediction error (however you measured it). It would give you some idea of the likely precision error in your real estimates. It's not a perfect method for various reasons, but it's very powerful.

(1): (As pointed out by a comment, they are not technically independent as the data are being reused).

In fact what happens in split 1 and 2 is not "completely independent" because partly the same data are used for fitting (namely folds 3-5), also of fold 1 and 2 the data that was used for fitting before is used for performance estimation now. It is true that this is still "cost effective", but surely not as good in terms of precision as if you had 5 independent sets of data (which of course would be much more expensive). — Christian Hennig, Oct 14 '19 at 14:29
Thank you! I updated my question as I think the further questions are too substantial for just a comment. — Ben, Oct 15 '19 at 05:45

score 1 · Answer 3 · answered Oct 21 '19 at 20:35

Let's step back a bit and go to the primary use of cross validation: validation or verification, i.e. measuring predictive performance of a model you trained on some data at hand for the purpose of using it for prediction. No decision yet what that estimate is to be used for (if you like: no hyperparameter tuning, the model is final as it is). Also, for the moment we'll say that the blue Test data part of All data does not exist/is not available.

So we train a model on the training data and need to estimate its predictive performance. Cross validation now runs the training procedure $k + 1$ times:

on the whole Training Data,
and on each of the $k$ folds.

These runs are independent of each other in the sense that results of previous runs are completely "forgotten" for the next run.

The resulting models of course are not independendent of each other, as they were trained on almost the same training data. And neither should they be:

In order to take the test results for the $k$ surrogate models and use them as approximation for the generalization error of the final model, we assume that the $k$ so-called surrogate models are equal or equivalent to the final model trained on the whole Training Data since their training data differs only in a negligible way (leaving out $\frac{1}{k}$ of the data points) from the whole Training data.

Now, this assumption often doesn't completely hold: the surrogate models being trained on less data are on average a bit worse than the final model. Thus, the well-known slight pessimistic bias of cross validation.

We can use a weaker assumption: the $k$ surrogate models are equal or equivalent to each other, in other words, the models are stable against exchanging a few ($\frac{2}{k-1}$th) of their training data for other cases.

We can check either across the surrogate models (do e.g. their slopes and/or intercepts change noticeably or hardly at all) or with repeated cross validaton (see below) whether that assumption is met, and if not, we say our models are unstable.

However, cross validation is not a good simulation of getting entirely new data sets due to the overlap (= correlation) between the training sets of the folds, so the relevant variance e.g. for algorithm comparison cannot be calculated from cross validation: for that we'd need the training sets to be independent of each other.

When CV is meant to provide more data to train on,

I'd say CV is mostly meant to provide more test data: we pool test results for all cases in the data set.

But none of the CV surrogate models has more training data than a fixed train/test split that reserves $\frac{1}{k}$ for testing. And the CV results are usually taken as approximation for the model trained on the whole data set.

why not just repeating the whole procedure arbitrary times? After performing all splits, you just repeat the entire process again.

As long as your training procedure is completely deterministic, whenever you encounter a split that was already evaluated, you won't get any new results. So repeating makes sense only if you generate new splits - and that is done. However, the new split is still testing cases that have been tested before, just with surrogate model that had slightly different training sample. You can use this to extract information about the stability of the results of the training procedure, but if the bottleneck is the actual number of tested cases it doesn't help.

Basic question about cross-validation

3 Answers3