33

Assume that a model has 100% accuracy on the training data, but 70% accuracy on the test data. Is the following argument true about this model?

It is obvious that this is an overfitted model. The test accuracy can be enhanced by reducing the overfitting. But, this model can still be a useful model, since it has an acceptable accuracy for the test data.

Hossein
  • 3,170
  • 1
  • 16
  • 32
  • 18
    If 70% is acceptable in the particular applications, then I agree with you. – Richard Hardy May 11 '17 at 06:37
  • 6
    I'd fully agree with @RichardHardy. Take, for instance, a random forest: Often, by construction, the insample performance (not the out-of-bag performance) is close to 100%, so grossly overfitting. But still, the lower performance evaluated out-of-bag or on test/validation sets might be high enough to make it a useful model. – Michael M May 11 '17 at 07:28
  • Just keep in mind that 70% in this case is not a good approximation of the out of sample error. – Metariat May 11 '17 at 12:06
  • 1
    @Metariat Why not? This accuracy obtained on test set which is not used in the training phase. – Hossein May 11 '17 at 12:07
  • @Metariat, I think you might have confused the *validation* set and the *test* set, and their respective errors. The *test* error **is** the *out-of-sample* error. While *validation* error is an optimistic measure of a selected model, *test* error is not. The *test* error is an unbiased estimate of how the model will perform on a new sample from the same population. We can estimate the variance of the test error, so we are quite fine by knowing only the *test* error as long as the test set is not too small. – Richard Hardy May 11 '17 at 12:53
  • 1
    @Metariat, once it became clear what you mean by contrasting the test error vs. the out-of-sample error, I still do not follow your logic for why 70% in this case is not a good approximation of the out-of-sample error. We do not know the variance of the test error as the OP has not indicated that. But the variance can be estimated, and it need not be high (especially if the test set is large). – Richard Hardy May 11 '17 at 13:22
  • @RichardHardy Think of it this way: the model is overfitted and the out-of-sample accuracy is 60%, but the testing accuracy depends a lot on the data -> when you have a "lucky" testing set, the accuracy is 70%, base in this information, we conclude that the model is good enough. You can separate that testing set into several set in order to estimate the variance, but the mean is still 70%, and this is what's misleading. The notion of unbiased is applied only when you have a lot of testing set randomly picked from the population, here we have only one, so that's why it is not a good approx. – Metariat May 11 '17 at 13:40
  • 3
    @Metariat, Such an argument can take down most of statistical practice, e.g. the OLS estimator in the Normal linear model, $t$-test for equality of a sample mean to a hypothesized value, and what not. What is interesting is that the argument does not depend on the model overfitting on the training sample. It holds as well for underfit models and any model in general. Is that not correct? – Richard Hardy May 11 '17 at 13:45
  • Yes,the OLS estimator, the t-test of the overfitted model could be also misleading. It holds as well for the other models, but the accuracy of latters depend on the data less than the overfitted models. – Metariat May 11 '17 at 13:50
  • 4
    I wouldn't call that model overfitted. An overfitted model is established comparing validation performance to test performance. Even then, only if the test performance was considerably lower than acceptable, possibly catastrophically. Training performance means nothing. – Firebug May 11 '17 at 13:55
  • @RichardHardy I'm not sure getting what you mean by "it"? – Metariat May 11 '17 at 13:56
  • @Metariat, that is an interesting conjecture (that "it" can have a greater effect on overfitted models than on non-overfitted models). How would you explain the logic behind "it" (either formally or intuitively)? ("it" is what you refer to when you say "it holds".) – Richard Hardy May 11 '17 at 13:57
  • 1
    I agree with firebug, it does not follow at all from the stated numbers that the model is overfit. Perfectly fit models can have very different training and test performances, and often do. – Matthew Drury May 11 '17 at 14:12
  • As per @hxd1011's answer, it depends on your evaluation function, and which region of the ROC you most care about. In cases where we want to overweight TPR and underweight FPR, we use F_$\beta$ score with $\beta$>>1. – smci May 11 '17 at 23:19
  • As you look at the other answers make sure you are check @RichardHardy 's initial qualifier **If 70% is acceptable**. If 80% of your data is class A and 20% class B, 70% is actually worse than a naive model that always predicts class A. When you know there is potential over-fitting you need to be extra careful in your evaluation of the testing data. – Barker May 12 '17 at 23:46
  • Very simple models have this property of "overfitting" on the training set. Take Nearest-neighbour estimator. As to if overfitted estimators (in a validation sense) are useful in general, I believe if it is the case, you are using the wrong measure of performance (for instance precision/recall rate are closer to the justification of the model, rather than its "overfitting"). – Maxim May 13 '17 at 09:25
  • To me, overfitting is about how the testing error responds when i vary the complexity of the model. Its not possible toeasure this by comparing one fit model on two datasets. You must compare mutiple models across a range of complexities. – Matthew Drury Aug 25 '17 at 14:58
  • @Firebug can you elaborate why training performance "means nothing" and drop-off in performance between training set and test set is not an indication of overfitting? – dwhdai Jun 28 '19 at 14:31

5 Answers5

35

I think the argument is correct. If 70% is acceptable in the particular application, then the model is useful even though it is overfitted (more generally, regardless of whether it is overfitted or not).

While balancing overfitting against underfitting concerns optimality (looking for an optimal solution), having satisfactory performance is about sufficiency (is the model performing well enough for the task?). A model can be sufficiently good without being optimal.

Edit: after the comments by Firebug and Matthew Drury under the OP, I will add that to judge whether the model is overfitted without knowing the validation performance can be problematic. Firebug suggests comparing the validation vs. the test performance to measure the amount of overfitting. Nevertheless, when the model delivers 100% accuracy on the training set without delivering 100% accuracy on the test set, it is an indicator of possible overfitting (especially so in the case of regression but not necessarily in classification).

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • You asked for an example: take the code for a neural net on the *iris* dataset at https://stats.stackexchange.com/a/273930/2958 and then try with `set.seed(100)` for an illustration like the phenomenon described here and `set.seed(15)` for the opposite. Perhaps better to say "an indicator of *possible* overfitting" – Henry May 11 '17 at 17:04
  • Is it ever possible for a model to attain 100% accuracy on both train and test and has no overifitted ? – Hossein Oct 06 '19 at 05:57
  • 1
    @Breeze, I think you could ask this on a separate thread (and link to this one for context if needed). – Richard Hardy Oct 06 '19 at 07:53
  • I just did here is the [link](https://stats.stackexchange.com/questions/430192/is-it-ever-possible-for-a-model-to-attain-100-accuracy-on-both-train-and-test-a) – Hossein Oct 06 '19 at 08:20
33

In my past project with Credit Card Fraud detection, we intentionally want to over fit the data / hard coded to remember fraud cases. (Note, overfitting one class is not exactly the general overfitting problem OP talked about.) Such system has relatively low false positives and satisfy our needs.

So, I would say, overfitted model can be useful for some cases.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 6
    This answer is quite interesting as it presents a use case. I think with "hard-coded to remember" @hxd1011 means that the model made sure that each of the reported fraud cases lead to a "fraud flag" and that they were not smoothed or interpolated away by a, uhm, say, fitted function. Something like that, right? – IcannotFixThis May 11 '17 at 15:08
  • 1
    @IcannotFixThis yes. In fact, we tried many other ways to control false positive. But trying to overfit fraud cases, in crazy way worked well. – Haitao Du May 11 '17 at 15:10
  • 3
    In your case, your evaluation function is overweighting TPR and underweighting FPR, e.g. F_$\beta$ score with $\beta$>>1. (Now I know why my debit card company is so annoying, they flag any little thing, even faulty card-scanners at gas stations) – smci May 11 '17 at 23:16
  • 3
    That may be annoying, but it is thousands of times less annoying than having your financed ruined because someone nefarious got your card information. – Matthew Drury Aug 25 '17 at 14:56
14

Maybe: beware. When you say that 70% accuracy (however you measure it) is good enough for you, it feels like you're assuming that errors are randomly or evenly distributed.

But one of the ways of looking at overfitting is that it happens when a model technique allows (and its training process encourages) paying too much attention to quirks in the training set. Subjects in the general population that share these quirks may have highly-unbalanced results.

So perhaps you end up with a model that says all red dogs have cancer -- because of that particular quirk in your training data. Or that married people between the ages of 24 and 26 are nearly guaranteed to file fraudulent insurance claims. Your 70% accuracy leaves a lot of room for pockets of subjects to be 100% wrong because your model is overfit.

(Not being overfit isn't a guarantee that you won't have pockets of wrong predictions. In fact an under-fit model will have swaths of bad predictions, but with overfitting you know you are magnifying the effect of quirks in your training data.)

Wayne
  • 19,981
  • 4
  • 50
  • 99
  • Thanks. Do you mean that it is possible that this 70% accuracy is obtained on the quirks in the training data that is also available in the test data? Cannot we judge based on the accuracy of the test data? I think the quirks in the training data that are available in the test data should be learned during training. – Hossein May 12 '17 at 19:29
  • 1
    If I understand you, it would be the opposite: overfitting to quirks has given you your high accuracy in training. The reason you get a lower accuracy in testing is that those quirks don't apply to your overall dataset. But of course your training and testing sets -- even if you do cross-validation, which helps -- may be quirky in relation to your population. In which case your testing/validation results won't well-predict how you actually perform. – Wayne May 12 '17 at 20:06
  • You are right that the testing set may be quirky in relation to the population, but this is not specific to the overfited models. All of our evaluations suffer from it and we have no other choice than relying on the test set as a proxy for the true population. – Hossein May 13 '17 at 01:44
  • True, it's not unique to overfitted models, but it is amplified in an overfit model. I want to say _by definition_ the model is overfit because it clearly suffers from overemphasizing the quirks. – Wayne May 13 '17 at 02:20
7

No they can be useful, but it depends on your purpose. Several things spring to mind:

  1. Cost-Sensitive Classification: If your evaluation function overweights TPR and underweights FPR, we use $F_\beta$ score with $\beta \gg 1$. (such as @hxd1011's answer on antifraud)

  2. Such a classifier can be really useful in an ensemble. We could have one classifier with normal weights, one that overweights TPR, one that overweights FNR. Then even simple rule-of-three voting, or averaging, will give better AUC than any single best classifier. If each model uses different hyperparameters (or subsampled training-sets, or model architectures), that buys the ensemble some immunity from overfitting.

  3. Similarly, for real-time anti-spam, anti-fraud or credit-scoring, it's ok and desirable to use a hierarchy of classifiers. The level-1 classifiers should evaluate really fast (ms) and it's ok to have a high FPR; any mistakes they make will be caught by more accurate, fully-featured, slower higher-level classifiers or ultimately human reviewers. Obvious example: prevent fake-news headlines from Twitter account takeovers like the 2013 "White House bomb attack kills three" from affecting $billions of trading within ms of posting. It's ok for the level-1 classifier to flag that as positive for spam; let's allow it takes a little while to (automatically) determine the truth/falsehood of sensational-but-unverified news reports.

smci
  • 1,456
  • 1
  • 13
  • 20
2

I'm not denying that an overfitted model could still be useful. But just keep in mind that this 70% could be a misleading information. What you need in order to judge if a model is useful or not is the out-of-sample error, not the testing error (the out-of-sample error is not known, so we have to estimate it using a blinded testing set), and that 70% is barely the good approximation.

In order to make sure that we're on the same page on the terminology after the comment of @RichardHardy, let's define the testing error as the error obtained when applying the model on the blind testing set. And the out-of-sample error is the error when applying the model to the whole population.

The approximation of the out-of-sample error depends on two things: the model itself and the data.

  • An "optimal" model yields to an (testing) accuracy that scarcely depends on the data, in this case, it would be a good approximation. "Regardless" of the data, the prediction error would be stable.

  • But, an overfitted model's accuracy is highly dependent of the data (as you mentioned 100% on the training set, and 70% on the other set). So it might happens that when applying to another data set, the accuracy could be somewhere lower than 70% (or higher), and we could have bad surprises. In other words, that 70% is telling you what you believe it to be, but it is not.

Metariat
  • 2,376
  • 4
  • 21
  • 41
  • 1
    Since the used test set that obtains 70% accuracy is not seen in the training phase, is not it a good estimation of the out-of-sample error? I think the difference between training error (100%) and testing error (70%) is not a good indication of the difference between out-of-sample error and test error. It is possible that the overfitted model performs 70% accurate in the real world, while it is 100% accurate for the training data. I expect training error to be lower than test error, since the training data are used to generate the model, but the test data are not seen during training. – Hossein May 11 '17 at 12:38
  • yes, it could perform 70% accurate in the real world, even better, but could be worse. The problem with this approximation is that the 70% obtained is highly variable, and we don't know how variable it is, in contrary to the "optimal" model. – Metariat May 11 '17 at 12:42
  • I just modified my comment. Could you please take a look at it? You say that the difference between the training error and the test error is an indication of the variability of the test error, but I think the test error can be robust because it is not seen during training. – Hossein May 11 '17 at 12:44
  • 5
    I think you might have confused the *validation* set and the *test* set, and their respective errors. The *test* error **is** the *out-of-sample* error. While *validation* error is an optimistic measure of a selected model, *test* error is not. The *test* error is an unbiased estimate of how the model will perform on a new sample from the same population. We can estimate the variance of the test error, so we are quite fine by knowing only the *test* error as long as the test set is not too small. @Hossein – Richard Hardy May 11 '17 at 12:49
  • 3
    Can you elaborate on the difference between out-of-sample error and testing error? From my understanding, both are the error found when applying the model to samples not used to train the model. The only possible difference I can see is when using time-series data, the out-of-sample data should be from later time points, but this questions makes no mention of that. – Nuclear Hoagie May 11 '17 at 12:49
  • 1
    From my perspective, testing error is the error obtained when applying the model into a blinded set, it is an approximation of the out-of-sample error, which is the error obtained when applying the model into the whole population. And it is not the same, the valuable information is the out-of-sample error. And when the model is overfitted, the testing error is not stable, and bad supprises could happen on the other data sets. – Metariat May 11 '17 at 13:08
  • 4
    @Metariat, you are right that the test error is an estimate and it could be different from one test set to another. However, as I mentioned before, there is no reason to expect that the test error underestimates the true error (it does not, on average). So by taking a large-enough test sample, we can bound the test error with a desired level of confidence at a desired range. Now more practically, perhaps you should define the test error by editing your answer to make sure there is no misunderstanding of what you mean when contrasting the test error with the out-of-sample error. – Richard Hardy May 11 '17 at 13:17
  • 1
    But to be fair, when I re-read your answer for the second and the third time, it is getting clearer what you mean. My comments then are about making it even better and even clearer. – Richard Hardy May 11 '17 at 13:20