Cross Validation - purpose, need and utility

Question

The question might sound like an old one but I haven't got satisfactory answers for a number of questions I have about CV. I looked at several questions on CV here, here, here and here and yet things aren't very clear to me. So I will post a few questions which are related and hopefuly will help me (and other people) understand CV better.

Say I use K-fold CV for a learner (say linear regression). In the process, I learn K different sets of regression parameters. For each set of params, I know the error on the corresponding validation set. Now what do I do? Which of those K sets of learned params should I select? Do I retain the params set which gave the least error? Or should I now learn a new model with all training data together (in which case, why did I do CV in the first place)?
Does CV help avoid overfitting? Is this true in all cases or is it true only for specific type of learners (like Decision trees)?
How does CV avoid overfitting if the above is true in general for any learner?
Is CV needed only for small datasets or also for large datasets (think big data scale)?
If the purpose of CV is to get a better estimate of the prediction error (and assuming we have a large test set), why shouldn't we just use a fixed training set, and a number of disjoint test subsets, averaging the errors across all test subsets?

I think you will find the information you need in the linked thread. Please read it. If it isn't what you want / you still have a question afterwards, come back here & edit your question to state what you learned & what you still need to know. Then we can provide the information you need without just duplicating material elsewhere that already didn't help you. — gung - Reinstate Monica, Mar 31 '16 at 14:56
@gung: I won't really call my post a 'duplicate". The thread you linked to gives an impression that CV is a technique to avoid cross validation whereas in other threads (and the two answers posted to my thread), CV is described as a technique to get a better estimate of the error on unseen examples. BTW, I had read the thread you linked to before posting my questions as it made me feel even more confused about CV. — Nik, Apr 01 '16 at 06:06
@Nik, just a clarification on my answer. I did say that, _in the context of model selection, or parameters selection for a class of models indexed by a parameter_, CV can be used to avoid overfitting (point 2). At the same time (point 1), it's true that for i.i.d. data. CV error is usually a superior estimate of generalization error, than training MSE is (and btw here's the [paper](http://www.degruyter.com/view/j/stnd.2006.24.issue-3/stnd.2006.24.3.351/stnd.2006.24.3.351.xml) with relevant theorems, I managed to retrieve it). So both aspects are true. Why is that confusing? — DeltaIV, Apr 01 '16 at 09:07

score 5 · Accepted Answer · answered Mar 31 '16 at 08:01

Cross Validation (Wikipedia) is not a technique to avoid overfitting. It is a method that allows you to judge a model, by training it on some data, and testing it on some other, in order to measure its performance. That is all it is supposed to do.

The reason why we use Cross Validation, instead of one fixed test set, is that when doing model selection or hyperparameter tuning, you want to select the model that does best with respect to your test set. If you do that with one, fixed test set, you will have fit your model to your testing data, which you should not do if you want the estimate of your error to be meaningful. This is the same reason why Machine Learning competitions limit the number of test you can make per day (see this the Baidu controvery).

This does not change if you have a small or a big dataset, or with any type of learner.

After having used cross validation to build $K$ models, you test them on the $K$ tests sets, and it gives you $K$ value of your cost function. You can take the mean to be a measure of the bias of your model (how wrong is it), and the standard deviation to be a measure of the variance of your model (how much does it change if the input data changes a little) (See the Bias-variance tradeoff (Wikipedia). This should guide you in the design of your model, what hyperparamters you need to change, etc...

But when you are done tuning your model, and you do not need an estimate of the error anymore, you should train your model on your complete training data. You can also train $K$ sub models and average them, but this is not Cross Validation, this is Bagging (Wikipedia).

Bullet points Q/A

Which of those K sets of learned params should I select?

None of them, they serve the purpose of judging the performance of your model
Do I retain the params set which gave the least error?

If you are talking about the hyperparameters, yes
Or should I now learn a new model with all training data together

Yes
(in which case, why did I do CV in the first place)?

To get an estimate of what your generalization error was.
Does CV help avoid overfitting?

It does nothing by itself. It can help you spot overfitting, and you can do something about it.
Is this true in all cases or is it true only for specific type of learners (like Decision trees)?

True for all
How does CV avoid overfitting if the above is true in general for any learner?

By showing you the training error and the test error
Is CV needed only for small datasets or also for large datasets (think big data scale)?

It is valid for all scales
If the purpose of CV is to get a better estimate of the prediction error (and assuming we have a large test set), why shouldn't we just use a fixed training set, and a number of disjoint test subsets, averaging the errors across all test subsets?

Because you will fit the hyperparameters of your model to reduce the test error, and you will have fitted your model to your test set. If you do this, the "test error" you get is no longer a test error, and is too optimistic.

DeltaIV · Answer 2 · 2016-03-31T16:57:50.583

The last one. You learn a new model on all the data you have. Why did you compute CV? To get an estimate of the generalization error, i.e., of the error you will get on unseen data (coming from the same data generating process, of course). You could just use the MSE on the training set, but that's usually smaller than the error you will get on unseen data (it's an optimistic estimate), and it can be extremely smaller (even zero) if your model is overfit to training data. The CV error, defined as the average of the MSE errors on each fold, is a better estimate of the generalization error. Anyway, often CV is not used to estimate the generalization error, but to choose between different models.
It definitely helps, if you're using it to choose between different models. Example: fit different degree polynomials to the same training set (and please, use orthogonal polynomials :). Now, you want to select the model which has the best generalization properties, i.e., which will make better predictions on new, unseen data. If you chose the model with the lowest training set MSE, you would invariably end up selecting the highest order polynomial, which would likely overfit training set data. Instead, the CV error would usually be minimum for a polynomial of intermediate degree. Concerning the class of learners for which it works, there's a theorem but I'm sorry I can't recall it. You should google for papers by Efron, Hastie or Tibshirani, I guess it's in one of them (or at least they may give a reference to it). One important thing to note is that, at least in the version I know, the samples must be i.i.d., which prevents applying CV in its basic form to time series (there are workarounds, though).
Well, basically it works for the same reason why a validation set works. If your model is very complex, it will have low bias but high variance. This means that its predictions may have been very different, had the training set been a different one. With cross-validation, you quantify this variability by training it multiple times on different subsets of the original data set, and testing it on the hold-out data. If the model overfits the training subsample, then it will perform poorly on the hold-out data.
In the age of Big Data, it's always needed (when the hypotheses for its applicability hold). For example, consider deep neural networks. Imagenet is a deep neural network with 60 milions parameters. Even if you have a huge data set, this is still an incredibly flexible model. Now, you need to learn its parameter. You start training Imagenet with some optimization method, and you have to decide when to stop training. If you just look at the trend of the training set MSE, it will continue decreasing, on average, the longer you keep learning. You're likely going to end up with an horribly overfit model. If instead you use a CV error curve or a test set curve as a guide, they usually will start going up at some point, and you can use that as an indication to stop training your DNN.For a better example, think genomic data. Here the number of predictors is so huge, that even when you have a big data set, you may have more predictors than observations. In this case you need to regularize your regression (ridge regression, LASSO, elastic net, etc.). How to choose the regularization parameter? CV.
Because you're basically throwing away data which you could use to train a more complex model. It looks like today the winning approach in statistical learning is to train extremely complex (extremely flexible) models, with very low bias and huge variance, on huge data sets, in order to prevent overfitting. In such a context, the more data you have, the more complex a model you can fit, while still retaining good generalization properties.

Cross Validation - purpose, need and utility

2 Answers2