Do out-of-sample fitting methods solve the problem of over-fitting?

Question

Suppose we have a regression model, and we want to fit this to training data, and then make predictions on test data. There is a well-known danger that out-of-sample predictions will be poor, due to "over-fitting" of the model to the training data.$^\dagger$ If I understand it correctly, the phenomenon of over-fitting occurs because (1) the fitting method involves optimisation of in-sample prediction taken over the training data, and thus conforms excessively to this data; or (2) the model selection method does not adequately penalise complexity. Some analysts use cross-validation on a "validation set" to deal with over-fitting, which still involves an optimisation, but the optimisation is now done using out-of-sample predictions (i.e., the regression model is fit with the training data, but the prediction errors are obtained from the validation data).

It occurs to me that there are already well-known methods for fitting a regression model on a "leave-one-out" basis (e.g., for linear regression we can minimise the LOOCV statistic instead of minimising the residual-sum-of-squares). So the obvious question is, why bother with a "validation" step. Why not just use this leave-one-out fitting method on the training data to begin with, so that we optimise out-of-sample prediction? In combination with an appropriate way to penalise model complexity (e.g., by using partial F-tests for model terms), that would seem to deal with over-fitting, since while still using all available data points for the fit in each case (by only leaving one out, rather than leaving out a whole "validation set").

My questions: Am I right that over-fitting is caused by the fact that we optimise based on errors from in-sample predictions? If so, then does the use of optimisation based on errors from out-of-sample prediction avoid over-fitting? For example, if we fit a linear regression by minimising LOOCV (and with model selection that tests inclusion of model terms), is that enough to avoid over-fitting?

$^\dagger$ There is an excellent discussion of over-fitting in Chapter 7 of Hastie, Tibshirani and Friedman (2016), examined within the context of a linear regression model with prediction error measured by squared-error loss. For this loss function they link the phenomenon of over-fitting to the bias-variance trade-off. There are also some answers on this site (e.g., here) that discuss over-fitting in terms of the bias-variance trade-off.

Overfit occurs because empirical risk is a biased estimate of population risk. CV reduces the bias but as long as there is residual degrees of freedom due to hyperparameters bias remains. — Cagdas Ozgenc, Nov 08 '19 at 20:46
That is an interesting comment. Would you care to write a larger answer detailing the working? — Ben, Nov 08 '19 at 21:38

score 3 · Answer 1 · answered Jul 03 '19 at 06:55

Validation sets and overfitting

Let us start with your question about why we use a validation set. We use this to get an estimate of the true out-of-sample error of our model. We use a separate set here because, as you know, looking at in-sample residuals will give us a too optimistic idea of our performance.

Now, suppose we apply our model $M_1$ to the validation set and obtain an error that does not make us happy... but then we notice something about that validation set that $M_1$ didn't catch very well. Which gives us an idea of how to improve the model. So we go back to our training and test set, tweak the model, optimize it a bit more, finally apply the resulting model $M_2$ to the validation set, and success! $M_2$ performs better on the validation set than $M_1$! Hurray! And since we didn't use the validation set in training the model, everything is fine, right?

Except, of course, it isn't. We have used the validation set, since it informed our modification of $M_1$ to $M_2$. We have already started down the slippery slope towards overfitting on the validation set, and we can happily slide down that slope some more by tweaking our model yet more to $M_3$, $M_4$, ...

Which gives my answer to your question: overfitting is caused by using observations to inform the model (whether through fitting the model to the training set, evaluating the fit on the test set, or "being inspired" by the validation set as in the previous paragraphs), and then expecting the model's performance on these observations to predict its performance on new data.

Cross-validation won't help

By now, you know that CV won't help us here, because by definition, the data point(s) left out are used in model selection. In fact, it's pretty easy to overfit to LOOCV. Here are some random data with no relationship whatsoever:

I'll fit (orthogonal) polynomials, starting with a zero degree horizontal line, up through a 6th degree one:

Let's apply LOOCV to these models. Below, I'll plot the LOOCV sums of squared (out-of-bag) errors.

As we see, the 5th degree polynomial "performs best", so here is the "best model":

As I said, there is actually no structure at all in these data, so anything beyond the horizontal line is pure overfitting. So we see that it's easy to overfit to LOOCV (it only took me running three different random seeds to get this example), and of course it would be easy to construct a somewhat more complicated example of "overfitting to the validation set" in the sense above.

The bottom line

If you want an estimate of your model's performance on new data, then apply it to previously unseen data... and resist the temptation to tweak the model based on the results. (Or, do tweak it, but then evaluate the tweaked model on new data. And don't be surprised if the un-tweaked model performs better on this new data than the tweaked one.)

R code

nn <- 30

set.seed(3)
predictor <- sort(runif(nn))
response <- runif(nn)

plot(predictor,response,pch=19,xlab="Predictor",ylab="Response")

max_degree <- 6

cv_sse <- rep(0,max_degree+1)

for ( ii in 0:max_degree ) {
    for ( jj in 1:nn ) {
        if ( ii == 0 ) {
         cv_sse[1] <- cv_sse[1]+(response[jj]-mean(response[-jj]))^2
         lines(c(min(predictor),max(predictor)),rep(mean(predictor),2))
        } else {
            model_predictor <- predictor[-jj]
            model <- lm(response[-jj]~poly(model_predictor,ii))
            prediction <- predict(model,newdata=data.frame(model_predictor=predictor[jj]))
            cv_sse[ii+1] <- cv_sse[ii+1]+(response[jj]-prediction)^2
            lines(predictor,predict(lm(response~poly(predictor,ii))))
        }
    }
}
lines(predictor,predict(lm(response~poly(predictor,which.min(cv_sse)-1))),col="red",lwd=2)

plot(0:max_degree,cv_sse,type="o",pch=19,xlab="Degree",ylab="CV SSE")

"[I]t only took me running three different random seeds to get this example" Aha! So you over-fit your example! J'accuse! ;) — Ben, Jul 03 '19 at 08:54
This is a very helpful answer (+1). But I wonder if the outcome of your LOOCV polynomial fit is just an aberration, coming from the fact that there is no test here checking whether or not there is evidence for the greater complexity. If you were to fit LOOCV polynomial regression with F-tests for the complexity, I would expect that this would usually hit the true model (i.e., constant only). Am I wrong? — Ben, Jul 03 '19 at 09:04
To a degree, you have a point. What I am arguing against is *the use of LOOCV as evidence for a given model*. I fully agree that F tests would help in detecting and weeding out spurious complexity. If they are applicable, that is. It's hard to use F tests to decide between two neural network architectures. — Stephan Kolassa, Jul 03 '19 at 09:16
Yes, I agree. What I have in mind is a model selection process where you use traditional methods for penalising complexity (e.g., using partial F-tests for model term) and you fit the parameters with LOOCV. Intuitively, it seems to me like that would be sufficient to avoid over-fitting, but I'm not certain of this. — Ben, Jul 06 '19 at 08:29
"*...and resist the temptation to tweak the model based on the results.*" Ok, how does that work in practice? I train a model using CV, and then applied to a validation set and it failed miserably, do I just go to my manager explain what happened and then commit seppuku? Or do I go back and try something else (maybe another class of models altogether) but still risk "meta-overfitting" my data? — Skander H., Jul 08 '19 at 01:36
@SkanderH.: that's a valid question. Here is what I would answer: if you know what you are doing, and you trained your model to the best of your ability, and it still fails miserably, then [you may already be close to the limit of predictability for your task](https://stats.stackexchange.com/q/222179/1352). ... — Stephan Kolassa, Jul 08 '19 at 05:35
... @SkanderH. I would suggest: (1) informing your manager that expectations may have to be revised, (2) try to understand whether there is anything *systematically* un-modeled you could include, (3) follow my parenthetical advice in "the bottom line", especially (4) collect more data on which you can evaluate your "miserably failing" model and anything you get by tweaking it, (5) in the long term, educate your manager and users on the limits of predictiability, possibly (6) work to get (or become yourself) a manager who understands these limits. — Stephan Kolassa, Jul 08 '19 at 05:36
Somwhat off topic, I have seen the sentiment that managers should act as *crap umbrellas*, keeping the crap from higher up from falling on their reports. I think that the data scientist needs to educate her/his manager on the limits of data science, even if the manager is not a data scientist her/himself, and the manager should then work to protect the data scientist from unreasonable accuracy expectations. Yes, we would all like to get perfect predictions. No, we often won't get them. Life is not necessarily fair. Especially not for data scientists. — Stephan Kolassa, Jul 08 '19 at 05:39
Thanks. I'm curious what are the implications of what you demonstrated in your response are for things like Bayesian optimization and neural architecture search and all of the much touted auto-ML solutions that are supposed to remove the "art" out of the supervised learning process. — Skander H., Jul 08 '19 at 16:58
@SkanderH.: I would say that AutoML is conceptually nothing else than a model again. A bigger black(er) box, if you will. Even AutoML has (hyper)parameters you can adjust. If you are happy with your AutoML-generated model, great. If not, you can start tweaking the hyperparameters. And then everything I have been writing here applies. Same for Bayesian approaches, where you need to decide on priors, and NN architecture, where you need to specify the search space. — Stephan Kolassa, Jul 08 '19 at 17:06
CV doesn’t solve the problem, nothing can. But it mitigates. Try how many times you overfit without CV vs with CV out of 1000 trials. That’s the correct experiment to run. Showing that random phenomenon ruins things is not really anything interesting. — Cagdas Ozgenc, Nov 09 '19 at 10:09

Do out-of-sample fitting methods solve the problem of over-fitting?

1 Answers1

Validation sets and overfitting

Cross-validation won't help

The bottom line

R code