40

Why does a cross-validation procedure overcome the problem of overfitting a model?

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
user3269
  • 4,622
  • 8
  • 43
  • 53
  • 4
    Look at the works of [Alain Celisse](http://math.univ-lille1.fr/~celisse/). His work as far as I read (too little alas) is about merits of cross-validation. – mpiktas Apr 01 '11 at 17:10
  • @mpiktas Indeed, and one of his paper was already proposed for the CVJC, http://www.mendeley.com/groups/999241/crossvalidated-journal-club/papers/. – chl Apr 01 '11 at 19:57

6 Answers6

29

I can't think of a sufficiently clear explanation just at the moment, so I'll leave that to someone else; however cross-validation does not completely overcome the over-fitting problem in model selection, it just reduces it. The cross-validation error does not have a negligible variance, especially if the size of the dataset is small; in other words you get a slightly different value depending on the particular sample of data you use. This means that if you have many degrees of freedom in model selection (e.g. lots of features from which to select a small subset, many hyper-parameters to tune, many models from which to choose) you can over-fit the cross-validation criterion as the model is tuned in ways that exploit this random variation rather than in ways that really do improve performance, and you can end up with a model that performs poorly. For a discussion of this, see Cawley and Talbot "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", JMLR, vol. 11, pp. 2079−2107, 2010

Sadly cross-validation is most likely to let you down when you have a small dataset, which is exactly when you need cross-validation the most. Note that k-fold cross-validation is generally more reliable than leave-one-out cross-validation as it has a lower variance, but may be more expensive to compute for some models (which is why LOOCV is sometimes used for model selection, even though it has a high variance).

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 2
    One thought I've had is that cross validation is merely applying a different (implicit) model for the data. You can certainly show this with the "cousin" of CV, the non-parametric bootstrap (which is based on a Dirichlet Process model with concentration parameter of 0). – probabilityislogic Aug 05 '15 at 14:15
  • Interesting idea. My view is that (for the models I am interested) the separation into parameters and hyper-parameters is computational rather than logical; the hyper-parameters are still parameters that need to be fitted to the data, and that doing this indirectly using cross-validation doesn't really change that. In may last paper, I investigated tuning what are normally hyper-parameters of a kernel model using the training criterion and adding an additional regularisation term to avoid overfitting the model selection criterion (LOOCV) and it worked quite well. – Dikran Marsupial Aug 07 '15 at 06:57
  • 3
    Why is k-fold CV is more expensive than leave-one-out? My experience (and my intuition) says otherwise. Since in k-fold CV we are doing k tests, wherever in L1O, we are doing N (>>k) tests, and usually the training part takes longer due to some matrix inversion, so isn't L1O the expensive option? – jeff Dec 10 '15 at 17:30
  • 2
    Leave one out can be performed (or approximated) as a by-product of fitting the model to the whole dataset, at very little additional cost, for a wide range of models (e.g. linear regression). I'll edit the answer to make this more clear. – Dikran Marsupial Dec 10 '15 at 17:33
  • 1
    My understanding of leave-one-out is that it is k-fold CV -- the best but most computationally expensive form of k-fold CV, where k = dataset size. – Daniel Winterstein Dec 24 '16 at 12:11
  • It generally isn't the best (unless the size of the dataset is very small) because it has a high variance and can have some pathological behaviours. It's main advantage is that for some models it is computationally cheap, but for most applications 5- or 10-fold cross-validation is probably a better option if computationally feasible. – Dikran Marsupial Dec 28 '16 at 10:57
  • The comment that increasing *k* in k-fold cross-validtion increases the variance of the estimator is interestin. In several textbooks, I have read that n-fold cross-validation is best because it has the smallest bias. Do you have a reference that elaborates this point? – cdalitz Jul 06 '21 at 06:10
  • @cdalitz n-fold cross-validation is biased in the sense that it gives an estimate of the performance of a model trained on 100*(n-1)/n% of the data - so leave-one-out CV is almost unbiased in the sense that it makes (n-1)/n as large as possible. – Dikran Marsupial Jul 06 '21 at 13:27
  • The usual papers are Luntz and Brailovsky (but my technical Russian is a bit weak ;o) and Lachenbruch and Mickey, see my answer to this related question: https://stats.stackexchange.com/questions/154830/10-fold-cross-validation-vs-leave-one-out-cross-validation – Dikran Marsupial Jul 06 '21 at 13:37
  • @dikran-marsupial Sorry, my question was ambiguous: I was asking for a reference that demonstrates that the variance increases with $k$ in $k$-fold cross-validation. The answer to the related question argues otherwise, i.e., that *both* bias and variance are greater for 10-fold CV compared to n-fold CV (leave-one-out). – cdalitz Jul 06 '21 at 15:07
  • @cdalitz I'm not sure it is as simple as that, and it is likely to depend on the model being cross-validated and the size of the dataset. I think the high variance of LOO and of k-fold for small values of k are for different reasons, so you may end up with a "sweet-spot" of low variance for "sensible" values of k, such as 5 or 10. If you are using CV for model selection rather than performance evaluation, you can probably ignore the bias as it is likely to be commensurate for all models (which makes some Luntz and Brailovsky citations somewhat ironic ;o) – Dikran Marsupial Jul 07 '21 at 06:50
  • 2
    @dikran-marsupial Just realized that my question already was discussed in a number of threads. Here is a specific answer that summarizes different results: https://stats.stackexchange.com/a/357749/244807. The cited simulation studies indicate that in most cases leave-one-out has lower variance than 10-fold CV and for linear regression this is even known for some time (Burman 1989). There is, however, another argument for LOO not yet mentioned anywhere: the result of LOO is reproducible, i.e, two different researchers obtain the same results on the same data. – cdalitz Jul 07 '21 at 09:00
  • @cdalitz Thanks for the link, I will give that all a read! It could be that LOOCV has pathalogical cases where it can have extreme variance, but is good on the average case? Other forms of CV can be made reproducible (e.g. Raetsch had a set of datasets with pre-define test-training splits). However it has to be said if the results are not robust to the partitioning of the CV splits, they probably aren't very interesting! Bit-wise reproducibility of results isn't particularly valuable IMHO, independent replication is more useful. – Dikran Marsupial Jul 07 '21 at 12:46
22

Not at all. However, cross validation helps you to assess by how much your method overfits.

For instance, if your training data R-squared of a regression is 0.50 and the crossvalidated R-squared is 0.48, you hardly have any overfitting and you feel good. On the other hand, if the crossvalidated R-squared is only 0.3 here, then a considerable part of your model performance comes due to overfitting and not from true relationships. In such a case you can either accept a lower performance or try different modelling strategies with less overfitting.

Michael M
  • 10,553
  • 5
  • 27
  • 43
  • 11
    I think this answer is correct in spirit, but I disagree with the characterization of over fitting in the second paragraph. I don't believe that over fitting occurs when train error - test error > some bound, instead, I would characterize over fitting as a situation where increasing the complexity of the model slightly tends to *increase* the hold out error. Requiring that your train and test errors are comparable will often result in very *underfit* models. – Matthew Drury Feb 02 '16 at 16:52
16

My answer is more intuitive than rigorous, but maybe it will help...

As I understand it, overfitting is the result of model selection based on training and testing using the same data, where you have a flexible fitting mechanism: you fit your sample of data so closely that you're fitting the noise, outliers, and all the other variance.

Splitting the data into a training and testing set keeps you from doing this. But a static split is not using your data efficiently and your split itself could be an issue. Cross-validation keeps the don't-reward-an-exact-fit-to-training-data advantage of the training-testing split, while also using the data that you have as efficiently as possible (i.e. all of your data is used as training and testing data, just not in the same run).

If you have a flexible fitting mechanism, you need to constrain your model selection so that it doesn't favor "perfect" but complex fits somehow. You can do it with AIC, BIC, or some other penalization method that penalizes fit complexity directly, or you can do it with CV. (Or you can do it by using a fitting method that is not very flexible, which is one reason linear models are nice.)

Another way of looking at it is that learning is about generalizing, and a fit that's too tight is in some sense not generalizing. By varying what you learn on and what you're tested on, you generalize better than if you only learned the answers to a specific set of questions.

Wayne
  • 19,981
  • 4
  • 50
  • 99
8

Cross-Validation is a good, but not perfect, technique to minimize over-fitting.

Cross-Validation will not perform well to outside data if the data you do have is not representative of the data you'll be trying to predict!

Here are two concrete situations when cross-validation has flaws:

  • You are using the past to predict the future: it is often a big assumption to assume that past observations will come from the same population with the same distribution as future observations. Cross-validating on a data set drawn from the past won't protect against this.
  • There is a bias in the data you collect: the data you observe is systematically different from the data you don't observed. For example, we know about respondent bias in those who chose to take a survey.
TrynnaDoStat
  • 7,414
  • 3
  • 23
  • 39
  • 4
    Having your dataset not being a poor representation of the true population is generally considered a separate issue of over fitting. Of course, it is correct that cross-validation does not address them. – Cliff AB Feb 03 '16 at 01:49
  • "Here are two concrete situations when cross-validation has flaws" -> "Here are two concrete situations where the DATA has flaws" – MrR Aug 03 '21 at 12:55
3

From a Bayesian perspective, I'm not so sure that cross validation does anything that a "proper" Bayesian analysis doesn't do for comparing models. But I am not 100% certain that it does.

This is because if you are comparing models in a Bayesian way, then you are essentially already doing cross validation. This is because the posterior odds of model A $M_A$ against model B $M_B$, with data $D$ and prior information $I$ has the following form:

$$\frac{P(M_A|D,I)}{P(M_B|D,I)}=\frac{P(M_A|I)}{P(M_B|I)}\times\frac{P(D|M_A,I)}{P(D|M_B,I)}$$

And $P(D|M_A,I)$ is given by:

$$P(D|M_A,I)=\int P(D,\theta_A|M_A,I)d\theta_A=\int P(\theta_A|M_A,I)P(D|M_A,\theta_A,I)d\theta_A$$

Which is called the prior predictive distribution. It basically says how well the model predicted the data that was actually observed, which is exactly what cross validation does, with the "prior" being replaced by the "training" model fitted, and the "data" being replace by the "testing" data. So if model B predicted the data better than model A, its posterior probability increases relative to model A. It seems from this that Bayes theorem will actually do cross validation using all the data, rather than a subset. However, I am not fully convinced of this - seems like we get something for nothing.

Another neat feature of this method is that it has an in built "occam's razor", given by the ratio of normalisation constants of the prior distributions for each model.

However cross validation seems valuable for the dreaded old "something else" or what is sometimes called "model mispecification". I am constantly torn by whether this "something else" matters or not, for it seems like it should matter - but it leaves you paralyzed with no solution at all when it apparently matters. Just something to give you a headache, but nothing you can do about it - except for thinking of what that "something else" might be, and trying it out in your model (so that it is no longer part of "something else").

And further, cross validation is a way to actually do a Bayesian analysis when the integrals above are ridiculously hard. And cross validation "makes sense" to just about anyone - it is "mechanical" rather than "mathematical". So it is easy to understand what is going on. And it also seems to get your head to focus on the important part of models - making good predictions.

probabilityislogic
  • 22,555
  • 4
  • 76
  • 97
  • 2
    The model mispecification issue is the key. Bayesian methods (especially the "poor-mans" Bayes of evidence maximisation) can perform very poorly under model misspecification, whereas cross-validation seems to work pretty well almost all the time. The gain when the assumptions (priors) are "right" is generally much smaller than the penalty when they are "wrong", so cross-validation wins on average (as it makes almost no assumptions). It isn't nearly as intellectually satisfying though! ;o) – Dikran Marsupial Apr 02 '11 at 17:54
  • 2
    @dikran - interesting. I'm not so sure I agree with what you say though. So you say if the model is mispecified, then cross validation with that same model is better than using Bayes theorem? I would like to see an example of this. – probabilityislogic Apr 03 '11 at 00:16
  • @probabiltyislogic I don't think it is a particularly new observation, Rasmussen and Williams mention it on page 118 of their excellent Gaussian Process book (although it is essentially a reference to a similar comment in Grace Wahba's monograph on splines). Essentially the marginal likelihood is the probability of the data given the assumptions of the model, whereas the XVAL likelihood is an estimate of the probability of the data, regardless of the model assumptions, hence more reliable when the assumptions are not valid. A proper empirical study would be useful. – Dikran Marsupial Apr 03 '11 at 14:06
  • @probabilityislogic I should add that I like the Bayesian approach to model selection, but I almost always used cross-validation in practice simply because it generally gives results that are (statistically) as good as, or better than Bayesian approaches. – Dikran Marsupial Apr 03 '11 at 14:08
  • Cross validation selects models based solely on predictive performance; marginal likelihoods don't - they "account" for every dimension. In very high dimensional settings this matters - sometimes a lot. Say you've got a big predictor vector $X_i$ and a 1 dimensional response $y_i$. You need a model for $X_i$ to do dimension reduction in a fully Bayesian way. So you specify a joint model as $p(y_i|X_i, \theta_y)p(X_i|\theta_X)$. The second term has a much bigger contribution to the likelihood, so if a model does well there and bites it on the prediction the marginal likelihood won't care. – JMS Apr 04 '11 at 00:34
  • ... Here's a paper describing just this situation, for a particular model: http://ftp.isds.duke.edu/WorkingPapers/10-14.html. But the problem exists in big joint models regardless - the question is whether there is disagreement between the two "pieces" of the likelihood. – JMS Apr 04 '11 at 00:37
  • @JMS - this paper does make a seemingly good point with the "10 dimensional Gaussian" - but I think if you were only interested in prediction of $X_{10}$ then the marginal distribution of $X_{10}$ should be used, not the joint distribution of all X's. And it appears as though the factor loadings and idiosyncratic variances were just plugged in, rather than being estimated from the data - for we never know the "true model" or its closest approximation. Interesting read though - I may have to test out some simulations of my own. – probabilityislogic Apr 06 '11 at 12:07
  • @JMS For one "model" that isn't considered for prediction is to keep the second component, and not the first. This is sensible because the goal is prediction of $X_{10}$ not summarising the variation in $X_i$, its choosing the linear components which best predict $X_{10}$. So the example is incomplete in that only one 1-D model is considered. By using PCA regression one is implicitly asserting the *prior information* that higher variance components of $X_i$ predict $X_{10}$ better. – probabilityislogic Apr 06 '11 at 12:14
  • @JMS - perhaps this is one reason why PCA regression is not always computationally advantageous unless you actually know this - for then you just have a regular model selection - which 1-D component do I use? – probabilityislogic Apr 06 '11 at 12:20
  • ...continuing my ramblings... I would speculate if the "third model" was included which specified fitting a 2-factor model, but only taking the second factor for prediction would be favored much more by the log-likelihood. – probabilityislogic Apr 06 '11 at 12:26
  • @probabilityislogic I don't necessarily disagree about using the marginal likelihood of X10. But it doesn't get around the problem of a good joint model - the marginal distribution of X10 is obtained by integrating out X1-X9! CV would consider instead $X10|X1, ..., X9, \Theta$. Your objections about the example taking the PC with the largest eigenvalue is correct; PCR is often done this way and the example is to demonstrate that this is not necessarily a great idea. They augment the West 2003 factor regression to reflect this – JMS Apr 06 '11 at 22:11
  • @probabilityislogic In terms of your last comment, I'm skeptical in general (although you might be right about this example) Suppose that the data is given as $D = (D_1, D_2)$ with $D_1$ 1 dimensional and $D_2$ much higher dimensional. Then write the BF as $P(D_1|D_2, M_1)P(D_2|M_1) / [P(D_1|D_2, M_2)P(D_2|M_2)]$. I claim in this case that without some seriously informative priors, or maybe fantastic amounts of data, $P(D_1|D_2, M_1)P(D_2|M_1) / [P(D_1|D_2, M_2)P(D_2|M_2)] \approx P(D_2|M_1) / P(D_2|M_2)$ – JMS Apr 06 '11 at 22:20
  • In 10 dimensions, maybe not such a big deal. In 100, or 10,000 it *could* be as the likelihood is dominated by the "irrelevant" pieces. Think of partial least squares vs PCA. I'm very much a Bayesian but I haven't seen a satisfactory answer to the problem yet. – JMS Apr 06 '11 at 22:27
  • Also, in my haste I think I misstated what $D_1$ and $D_2$ are. I'm thinking of something like $D_1 = \{y_{i1}\}_{i=1}^n$ and $D_2 = \{(y_{i2}, \dots, y_{ip})\}_{i=1}^n$. So $D_1$ is a collection of 1d observations, not 1d itself. – JMS Apr 06 '11 at 22:33
  • @JMS - once again, this is using the joint model to assess the marginal model as your equation is $\frac{P(D_1,D_2|M_1)}{P(D_1,D_2|M_2)}$. Bayes theorem is doing exactly what you ask of it, which is why the 1-D model is "swamped". You should be asking for $\frac{P(D_1|D_2,M_1)}{P(D_1|D_2,M_2)}$ if predicting $D_1$ with $D_2$ is the target of inference. – probabilityislogic Apr 06 '11 at 23:03
  • And those "irrelevant" pieces are only irrelevant to summarising variation in $D_2$, they may be quite relevant for predicting $D_1$. – probabilityislogic Apr 06 '11 at 23:07
  • @JMS - point taken about using the condition distribution $X_{10}|X_{k},\Theta$, but this is not a "CV" thing, this is a model selection thing - there is nothing in CV which forces you to make this choice as "the model". – probabilityislogic Apr 06 '11 at 23:09
  • (cont'd) and further, this procedure assumes that you know $X_{k}$ and that you know $\Theta$ when it comes to actually predicting - hardly likely in practice (especially for $\Theta$). These should be integrated out of the model if they are unknown, so that you are conditioning only on the information you actually have. – probabilityislogic Apr 06 '11 at 23:14
  • @probabilityislogic I think you mistook what I meant by irrelevant bits (or maybe I misstated), although I think we agree: What's relevant is the marginal fit to $D_1$. If a model fits well to that one dimension then we should tolerate some misfit in the others, since we're interested in predicting $D^*_1|D^*_2$ given new data. In practice (ie as in Bayesian factor regression but many other areas too) people fit big joint models to do prediction on one margin. Generally we will infer models that (asymptotically) minimize the KLD from the true model... – JMS Apr 07 '11 at 01:52
  • ...but there is no reason in general that this model should also minimize any other risk function we care about, e.g., predictive MSE or similar. Really the model/estimator we want to pick is something like $\hat\theta = \arg\min_{t\in\Theta} \int\int R(D^*, t)p(D^*|\theta)p(\theta | D) d\theta dD$, the minimizer of our expected predictive risk (expectation wrt posterior pred.). In principle we could of course obtain this through a complete Bayesian analysis, but practically it'll range from difficult to impossible. I think Bernando & Smith discusses this problem further... – JMS Apr 07 '11 at 02:06
  • ...So where does CV come in? Leave k out CV approximates that integral with empirical averages, and rather effectively with large data. In that sense it is eminently convenient as it can be applied to any risk function with a decent empirical estimator with no trouble. The same can't be said of Bayesian procedures. – JMS Apr 07 '11 at 02:09
  • I forgot to mention that if this is at all interesting, you may want to have a look at some of the work using Gibbs posteriors. A nice paper (with references to the more foundational work) is "Gibbs posterior for variable selection in high-dimensional classification and data mining" by Jiang & Tanner, available on the arXiv: http://arxiv.org/abs/0810.5655. I haven't decided quite how to think about them, but it's a really interesting idea to me. – JMS Apr 07 '11 at 02:14
  • Seems like this paper takes the stance that in decision theory, the product of loss function and posterior matters, not each piece separately. "Why draw the arbitrary distinction?" Seems to be the idea here. sounds interesting – probabilityislogic Apr 07 '11 at 16:13
  • @probabilityislogic Intuitively, the idea that fixing a likelihood function should be about specifying how the data inform rather than just trying to match the sampling distribution appeals to me. It's also somewhat of a guard against model misspecification. Too many Bayesians wring their hands over priors while their likelihood is running the show! The question becomes how to make that operative, when can we get valid inferences, etc. Gibbs posteriors are the only work I'm familiar with which make a serious attempt to do this in applications. – JMS Apr 08 '11 at 01:02
2

Also I can recomend these videos from the Stanford course in Statistical learning. These videos goes in quite depth regarding how to use cross-valudation effectively.

Cross-Validation and the Bootstrap (14:01)

K-fold Cross-Validation (13:33)

Cross-Validation: The Right and Wrong Ways (10:07)