Dependence on k of k-fold cross validation

Question

I struggle to understand k-fold cross validation. I understand it is a tool to check the generalization error of a model and works shuffling the data and diving it into k-chunks. Than $k$ models are trained, each time using a different chunk for testing and the remaining $k-1$ for training. The mean and spread of the test errors give access to the generalization error.

My first question is:

1- Since k-fold CV is used just for checking generalization error, the finally deployed model can be trained on all data available ?

My second question probably shows some lack of my understanding:

2- How to choose and compare results with different k ? If we do 10-fold or 5-fold cross validation, the samples used for training are of different sizes $N_t$ ($N_t=9/10 N$ and $N_t=4/5 N$ respectively, where N is the amount of data available). If we plot a learning curve https://en.wikipedia.org/wiki/Learning_curve_(machine_learning) we now that the error depends on the number of training samples. So how should we compare results of cross validations with different $k$ ? But it would be strange that the errors obtained were not comparable, because than they would depend on $k$, which is not something nice.

Thanks for any insight.

score 1 · Answer 1 · answered Sep 09 '19 at 18:21

yes.
Usually, the results should be comparably independent of your choice of k:
If k = 5 vs. k = 10 i.e. training on 80 % vs. 90 % of the available data makes a big difference, chances are that your whole modeling set-up is unstable and you need to rethink your approach. Still, there are more direct approaches to check whether your $k$ surrogate models are stable (stability of model parameters or repeated/iterated cross validation), and I'd recommend to go for them rathen than the more indirect approach of comparing different k.

Last but not least, the learning curve you get with varying k is rather peculiar in that it approaches the one model trained on the whole data set. Any variation due to the data set at hand being just one realization of a data set of that size from the general population is not covered by that learning curve.

When you say that "training on 80 % vs. 90 % of the available data makes a big difference, chances are that your whole modeling set-up is unstable" --> do you mean that the size of the surrogate models must be large enough so that the learning curve has flattened ? — Thomas, Sep 12 '19 at 14:41

score 0 · Answer 2 · answered Sep 09 '19 at 16:32

Yes. In k-fold cross-validation, you would eventually train your selected model on all of the "training and validation" data. And this is where your understanding of (2) has to change.

If you weren't interested in comparing across different values of $k$ or across models or other hyperparameters, you'd just take your entire data and run k-fold cross-validation with your chosen (not changing) set of parameters.

However, if your intention was to compare across any variables that have an effect on your output model, you would split your data into:

training and validation, and
test

I am doing a course in data-science and we're learning that as long as "training and validation" data is 50% or more of the dataset, this is good. The test dataset needs to just be large enough to test.

This split is key to answer your next question.

You would train with different values of $k$ and other hyperparameters - anything and everything you wanted to test the change of - by performing k-fold cross-validation on the training and validation sample of the dataset.

What you change is then key to reading your results as meaningful or not. Say, you wanted to try different kernels for SVM model, then you would keep the $k$ in k-fold constant, $\lambda$ constant and change just the kernels. This would allow you to compare across kernels.

Say you had picked a kernel and a value of $\lambda$, you could then vary the $k$ in k-fold cross-validation and compare the effect of different $k$ values.

Eventually, you'll come up with one set of hyperparameters that performs better than the others. This you would then train on all of training and validation sample of the dataset. To work out the resulting model's performance, you would test it on the test sample of the dataset.

I hope this helps.

Dependence on k of k-fold cross validation

2 Answers2