How much is too much overfitting?

Question

Conceptually, where do you draw the line between an overfit model and adequately fit model?

It's clear that if your model is doing a couple percent better on your training set than your test set, you are overfitting. But let's say theoretically, I trained a model on a training set, then validated on a test set, and found out that my training set had an accuracy of 0.2% higher than my test set. Is this too much overfitting?

score 9 · Accepted Answer · edited Apr 13 '17 at 12:44

It's clear that if your model is doing a couple percent better on your training set than your test set, you are overfitting.

It is not true. Your model has learned based on the training and hasn't "seen" before the test set, so obviously it should perform better on the training set. The fact that it performs (a little bit) worse on test set does not mean that the model is overfitting -- the "noticeable" difference can suggest it.

Check the definition and description from Wikipedia:

Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

The possibility of overfitting exists because the criterion used for training the model is not the same as the criterion used to judge the efficacy of a model. In particular, a model is typically trained by maximizing its performance on some set of training data. However, its efficacy is determined not by its performance on the training data but by its ability to perform well on unseen data. Overfitting occurs when a model begins to "memorize" training data rather than "learning" to generalize from trend.

In extreme case, overfitting model fits perfectly to the training data and poorly to the test data. However in most of the real life examples this is much more subtle and it can be much harder to judge overfitting. Finally, it can happen that the data you have for your training and test set are similar, so model seems to perform fine on both sets, but when you use it on some new dataset it performs poorly because of overfitting, as in Google flu trends example.

Imagine you have data about some $Y$ and its time trend (plotted below). You have data about it on time from 0 to 30, and decide to use 0-20 part of the data as a training set and 21-30 as a hold-out sample. It performs very well on both samples, there is an obvious linear trend, however when you make predictions on new unseen before data for times higher than 30, the good fit appears to be illusory.

This is an abstract example, but imagine a real-life one: you have a model that predicts sales of some product, it performs very well in summer, but autumn comes and the performance drops. Your model is overfitting to summer data -- maybe it's good only for the summer data, maybe it performed good only on this years summer data, maybe this autumn is an outlier and the model is fine...

With kernel models, such as the SVM, it is not uncommon to get the best generalisation performance with zero error on the training set. IMHO looking at the training set error causes more problems than it is worth, better just to look at the validation set error (although that can be over-fit as well if you tune the hyper-parameters too much). — Dikran Marsupial, Mar 18 '16 at 11:27
Should difference between error in training set and test set be taken into consideration while comparing two different models or whichever model gives least error on test set should be selected? — Siddhesh, Mar 18 '16 at 11:35
@Siddhesh you have two models: `model1` correctly classified 2% of cases in train and 2% in test sets (0% difference), `model2` correctly classified 90% cases in train and 50% in test set (30% difference) -- which one would you choose..? The difference can *suggest* problems, but it does not measure model performance *per se*. — Tim, Mar 18 '16 at 11:39
@Tim : But what if test error is comparable , here is the question I have http://stats.stackexchange.com/questions/202339/should-difference-between-accuracy-of-model-on-training-data-and-testing-data-be — Siddhesh, Mar 18 '16 at 11:53
@Siddhesh as written by Dikran Marsupial and as in stated in my answer and the comment above, the difference does not have to suggest anything. The example in my answer illustrates situation where there is no difference between train and test sets but still model behaves poorly on future data. — Tim, Mar 18 '16 at 12:13
so what if we assume that be have fully represented the data that we wish to model. I.e. we are modeling time series and sample across 100 years of data sampled every second to predict temperature the next day. Then does a small difference in training and test error matter? — foboi1122, Mar 19 '16 at 01:04
@foboi1122 If you'd have 100% representative data that you'd train your model on the whole dataset and you wouldn't care about overfitting since overfitting would mean perfect fit to *the* data :) — Tim, Mar 19 '16 at 06:30

How much is too much overfitting?

1 Answers1

Linked