4

I have an extremely simple classification problem. My data-set looks like this:

| feature | target |
| A       | 1      |
| A       | 1      |
| B       | 0      |
| A       | 0      |
....................
| A       | 0      |
| B       | 0      |
| A       | 1      |

As you can see the feature cant take only two values (A and B) and the target is always either 0 or 1. My goal is to predict probability of target = 1 given a value of the feature.

I construct a data set such that probability of target = 1 does not depend on feature and is equal to 0.5. I have generated a data set in which feature is equal to A 1000 times and equal to B also 1000 times.

Just by chance for the feature = A 1 is observed 515 times out of 1000 and for the feature = B 1 is observed 482 times out of 1000.

I have two alternative models. The first one states that probability of target = 1 does not depend on values of the feature (this model is correct per construction). The second model states that probability of target = 1 depends on the value of the feature (this model is an overfit per construction).

Now assume that I run a standard leave-one-out cross validation to find out if the second model is an overfit or not. When I take one observation with feature = A out, the number of 1s for feature = A will fluctuate between 515 and 514 and, therefore, the predicted probability for A will be either 515 / 999 or 514 / 999, which is very close to in-sample probability (515 / 1000). So, the second model will be better than the first model not only in-sample but also out-of-sample (obtained via leave-one-out cross validation)!

So, it means that we were not able to detect an over-fit with the leave-one-out procedure and the second model, which has smaller predictive power, wins the first model, which has better predictive power.

Is it a known problem? How to deal with this? Is there a procedure that is free of errors like this?

I understand that one could run a test that checks if there is statistically significant difference between probability of 1 for different values of features but my question is about cross-validation and leave-one-out specifically. For more complex data we do not have a test, we run a cross-validation there and I want to be sure that for those more complex cases the cross-validation does not fool me like in the described simple case.

Mayeul sgc
  • 205
  • 1
  • 7
Roman
  • 1,013
  • 2
  • 23
  • 38
  • 4
    For all you can observe, *there is no overfitting*. You can only tell your second model is overfitting because you know the true data generating process. And you can't expect LOOCV (or *any* other process) to detect overfitting based on information beyond observed data (and priors, if your analysis is Bayesian). – Stephan Kolassa Jan 03 '22 at 09:42
  • @StephanKolassa, I guess we can tell that the second model is overfitting NOT only because we know that it does per construction. I guess that in the described situation the data itself can tell us that the second model is overfit. Namely 515 out of 1000 and 482 out of 1000 indicate that we do not have enough evidence to assume dependence of the probability of the feature. So, just from the data we can conclude that it is an overfit and the fact that leave-one-out fails to do it, indicates that leave-one-out has some drawbacks. Am I wrong? – Roman Jan 03 '22 at 09:47
  • Well, if you want to bring in NHST, then yes. But your last paragraph explicitly states that you want to look at LOOCV, not NHST. And if you do decide to test for significance and use the null model if the test comes out insignificant, then that is a Bayesian approach, where your null model is your prior. – Stephan Kolassa Jan 03 '22 at 09:50
  • 3
    This is not specific to LOOCV, k-fold CV would "fail" here as well, not because it has "drawbacks", but this is simply the best guess based on the data. If you do not know the true underlying process generating the data you cannot conclude that the second model is overfit. – user2974951 Jan 03 '22 at 10:14
  • 1
    You could however find out if the second model is overfit by evaluating the predictive performance on another test set (if you have it), and there you would be able to determine if your model is worse than some baseline. – user2974951 Jan 03 '22 at 10:15
  • @user2974951, but if I can detect an overfit by using another data set, doesn't it mean that that I can detect an overfit by doing k-fold with k=2? – Roman Jan 03 '22 at 10:18
  • 2
    If the same happened and you knew the true underlying model would produce 1 for feature=A with probability 0.52 and for feature=B with probability 0.49, which of course based on the data is at least as realistic as your uninformative true model, what would you think? – Christian Hennig Jan 03 '22 at 12:28
  • @ChristianHennig, I guess my prior knowledge is not essential here. We have a sequence of models of increasing complexity (let's say: constant, linear, quadratic, cubic, ...). Then we should not accept a model with a higher complexity, if the observed data could easily be generated under assumption of model with lower complexity. For example, we should not accept the linear model if the observed data could be obtained by a distribution that does not depend on features. I though that LOOCV is used exactly for this purpose (to select the model of proper complexity) but it seems to fail. – Roman Jan 03 '22 at 12:39
  • @Roman My point is that the data do not allow to distinguish between a situation in which your random model with no influence of the feature is correct and another one in which the model with flexible/estimated parameter is true and not the simple one. Note that a significance test cannot identify that either (not rejecting the H0 doesn't mean it's true). So you expect LOO-CV to do something here that is in fact impossible, unless you use a procedure that is barred from choosing the more complex model for reasons other than what the data show. Also see my answer. – Christian Hennig Jan 03 '22 at 12:54
  • 1
    It's a great question though! (+1) – Christian Hennig Jan 03 '22 at 12:57

4 Answers4

5

The problem here is that your dataset is a bit of an outlier in the population of datasets from the stated data generating process, in that the feature is more correlated with the target than average.

If the feature and target were randomly generated with equal probabilities, then the probability that the feature matches the target is a Bernoulli trial with probability 0.5. In this case, out of 2000 samples, there are 1033 = 515 + (1000 - 482) "successes", and I think the probability of there being 1033 or more "successes" from 2000 trials is only about 0.073.

Now the statistical distribution for the "test set" in each fold of the leave-one-out data is the same as that for the "training set", so the test data is also an unlikely sample from the true data generating mechanism, so it can't be expected to give the "right" answer.

There is nothing wrong with leave-one-out cross-validation, the problem lies with being unlucky in the sample of data you have obtained.

One thing you might want to do is to compute the Bayes factor comparing the two hypotheses, which I suspect would tell you that the evidence is not strongly in favour of either hypothesis, which is reasonable as the difference is only in 33 of the 2000 observations.

Another way of looking at this would be to use NHSTs and consider the power of the test, which would probably be rather low. If you took either model as H0 in the test, you would be unable to reject it, which indicates that there isn't enough data to be confident of a difference in performance between the models. Essentially you need a lot of data to be able to be "confident" that a very small effect (such as this one) is not a random artefact.

The key point is that cross-validation can provide evidence of over-fitting, it does not itself provide a reliable indicator of over-fitting. You need to consider the uncertainties involved.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 1
    7% is not such a big outlier. I tried only 3 times to get this data set. – Roman Jan 03 '22 at 11:28
  • 1
    To make my point I do not even need such an "extreme" case. Let's consider the situation where i got 503 and 497 ones for A and B respectively. It can easily happen. And for this data set LOOCV with the second model (which assume dependency on feature) will given better result than the correct model. – Roman Jan 03 '22 at 11:30
  • 3
    @roman I think you are missing the point. A small dataset with a small effect is difficult to distinguish from a small data with no effect, especially if you select a dataset that is a (slight) outlier. It isn't a pass/fail test for over-fitting, it gives you an *estimate* of the overfitting. If the dataset is too small, that estimate will be less reliable. – Dikran Marsupial Jan 03 '22 at 11:33
  • 1
    Note you have computed a probability of 515/1000 for A, what is the uncertainty on that probability? The confidence interval is about 0.48 to 0.56 (according to a binomial CI calculator I found online). So the model is telling you that there may be no connection between the feature and the target. – Dikran Marsupial Jan 03 '22 at 11:40
  • I know that in my case I have not enough data to believe that probability depends on feature. I can also prove it either with the confidence interval, like you suggested, or with another statistical test. I have problem with the fact that LOOCV in most of the cases will prefer an overfitted model to model with a proper complexity and I would like to know what to do with it. For example I fill fit models of increasing complexity: constant, linear, quadratic, cubic and will evaluate them with LOOCV. What if LOOCV shows me that the best results are obtained with quadratic function? – Roman Jan 03 '22 at 12:03
  • 1
    @Roman as I said, CV can provide evidence of over-fitting, but nothing more than that. No method can provide proof of overfitting other than asymptotically. Note that in this particular example, even though it is almost pathological for LOOCV, does still provide some evidence of overfititing as the LOOCV probabilities *are* slightly more moderate than those obtained on the full dataset. It is such a small degree of over-fitting, you can't expect to reject it reliably with so little data. – Dikran Marsupial Jan 03 '22 at 12:08
  • 1
    BTW it could just as easily be the other way round. Say the true probability was 515/1000 for A. By bad luck, you could collect a dataset that had a probability of 500/1000, in which case CV would select a constant model that was too simple. This would also be a case of overfitting the model selection criterion. See my paper (with Mrs Marsupial) https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf – Dikran Marsupial Jan 03 '22 at 12:30
4

Your example is one of purely observation-driven modeling. And your observations indicate that there is a relationship between your feature and your target. Yes, it's a weak association. But that it is overfitting is not evident from the data you have observed! You can't very well accuse LOOCV of failing to detect overfitting that you only know is there because you know the true data generating process (DGP), i.e., you have information that is not available to the LOOCV method.

Also, "overfitting" is not a Boolean attribute. On the one hand, we never know the true DGP outside simulations, so we can never truly say there is overfitting: anything could conceivably have an influence on our outcome. On the other hand, if you believe in "tapering effect sizes", we will never be able to capture all influences, so we will always have underfitting.

Thus, it makes more sense to think of "overfitting" as a continuum. How much worse does adding a predictor make my model for future expected losses (which we again will only be able to estimate, unless we know the true DGP)? Thus, we have to think about how much signal there is in our data. In the present case, the overfitting is quite weak, and as Dikran says, it is hard to distinguish a weak effect from no effect whatsoever. And adding this feature will only have a very small effect on future predictions, so the overfitting, measured on a continuum, is small.

Per above, you are modeling purely based on the observations here. Essentially, such a model has no predilection towards a simpler model, like the one without an effect of the feature. There are various ways of including such a predilection. Essentially, we would bias our models towards simplicity, and per the bias-variance tradeoff, this may very well improve future predictions. In the terminology above, our overfitting, if present, would be weaker.

  • We could explicitly run a Bayesian model with a prior on the impact of the feature.
  • As you write, we could include elements of NHST, only accepting the more complex model if the improvement in fit is statistically significant.
  • Or we would use the "one standard error rule", which is very often used in cross-validation.
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
3

What happens here is that you compare one model that you know to be the true one with fixed true parameters to another model that will estimate these parameters. Obviously if you fix the parameters at the true values, you will normally be better on new data than if you estimate the parameters. But with a certain probability, any model selection method will select the model with estimated parameters, so will appear to be overfitting. This is not a specific problem with LOO-CV but applies to any model selection approach. If you compare a simpler model that you know to be true with a more complex model, the more complex model can be selected with a probability that is not negligible, in which case overfitting takes place.

If you set up the problem so that underfitting cannot happen, you will overfit with a certain probability and underfit with probability zero, so on average overfitting will happen. Only if you also look at what happens in case the more complex model is true you get a fair assessment of the model selection procedure.

Note furthermore that LOO-CV is known to be asymptotically equivalent to the AIC, which even asymptotically will overfit, see https://www.jstor.org/stable/2984877 (Note that I haven't made an effort to check whether potential assumptions in that paper apply here.)

The AIC (and equivalently LOO-CV) is however asymptotically good at finding an optimal prediction rule (for which overfitting is not as bad as underfitting). Note that in the given setup, if you just try to predict the target 1 or 0, rather than predicting their probabilities, any rule will be wrong 50% of the time, so that the "overfitted" model will not be worse than the true one when it comes to point prediction.

Christian Hennig
  • 10,796
  • 8
  • 35
  • I believe that my prior knowledge is not essential to my arguments. In real life settings I will have a sequence of models ordered by their complexity. The first model in this sequence is the "constant model" that states that their is no dependence on features. I plan to use LOOCV to find the model of the proper complexity. In other words, I should not accept a model, if the given data could easily be generated under assumption of less complex model. In the given example, I should not accept features dependent model if what I can observe even if there is no dependence on features. – Roman Jan 03 '22 at 13:36
  • For example, if I solve a regression problem having constant model, linear model, quadratic model, cubic model, then I would use LOOCV to choose the proper complexity. But now I am not sure anymore, since it looks to me that LOOCV would tend to choose too complex model. – Roman Jan 03 '22 at 13:38
  • 1
    @Roman See, I don't try to convince you that LOOCV is fine for you. LOOCV is a method to choose a prediction rule, and prediction rules based on too complex models are usually better than prediction rules based on too simple models, so LOOCV has a certain tendency to prefer an overcomplex model to a simpler one (this for example does not hold for BIC, which may give you something closer to what you apparently want), however as I wrote, a "test" of LOOCV in which *only* overfitting can happen but not underfitting will not give you a fair picture. – Christian Hennig Jan 03 '22 at 14:24
  • @Roman The issue here is *not* your prior knowledge per se, rather that you design an experiment in which overfitting can happen but underfitting cannot happen. – Christian Hennig Jan 03 '22 at 14:26
  • " I should not accept a model, if the given data could easily be generated under assumption of less complex model. " *is* expressing a prior belief about the models. "But now I am not sure anymore, since it looks to me that LOOCV would tend to choose too complex model." no, it will chose a more complex model where random sampling of the data gives *this* sort of (near) outlier. There will be other examples where the sample will be a (near) outlier in the other direction. It is almost unbiased, but it has a non-negligible variance which doesn't favour more complex model a-priori. – Dikran Marsupial Jan 03 '22 at 14:32
  • @DikranMarsupial I suspect, following Roman's argument, that LOOCV will prefer the more complex model whenever one result has an even slightly higher relative frequency of 1, say, in the sample (for the given model). I'm with Roman thinking that this is not an outlying situation at all. – Christian Hennig Jan 03 '22 at 16:41
  • @ChristianHennig yes, I mean more that in situations where there is a genuine effect there will be outlier samples where the constant model is preferred. It is the whole dataset that is the (mild) outlier, rather than the behaviour of CV. The likelihood of an *apparent* effect of this magnitude or greater is only about 0.07. Having said which, the apparent effect is also only very small. – Dikran Marsupial Jan 03 '22 at 16:47
  • The problem is one of over-fitting the model selection criterion (the LOOCV error is a random variable as it is a statistic evaluated on a finite dataset). In my paper (with Mrs Marsupial), https://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf, I give examples where over-fitting the CV model selection criterion causes underfitting as well as over-fitting the training data (see figures 5 and 6) but in a more complex/realistic setting. – Dikran Marsupial Jan 03 '22 at 16:51
0

Overfit meaning: you get better results on your train set then on your test set.

  1. There is no clear definition of a train / test set here

  2. lets assume we divide your uniformly dataset of 1000 data points into: 700 data points in the train set and 300 in the test set. becuse it is uniformly distributed, you can achieve an accuracy of ~0.5 on your train set. Now, you test your model on your test set. if the performance is about the same (we expect ~0.5 accuracy), this means there is no overfit in your model.

  3. regardless of the overfit question, an accuracy of ~0.5 mean that your model cannot predict very well (on any set)

Jonathan
  • 206
  • 2
  • 6
  • 1
    I would say that overfitting means that you reduce the loss on the training data (the "fit" of the model to that data) at the expense of a reduction in generalisation performance. The test set loss is only an *estimate* of generalisation performance, not generalisation performance itself. – Dikran Marsupial Jan 03 '22 at 11:10
  • @DikranMarsupial I agree, just trying to simplify things for this specifc example – Jonathan Jan 03 '22 at 11:12
  • it is very important in this question, because the test loss is not a good estimate of the generalisation performance in this specific case, because it is an unusual sample from the data generating process. – Dikran Marsupial Jan 03 '22 at 11:13
  • 0.5 probability is not essential in my argument. I could choose 0.8, and then for A I might get 807 ones out of 1000 and for B I might get 791 out of 1000 and LOOCV would say that the second model (assuming dependency on feature) is better than the correct model. – Roman Jan 03 '22 at 11:33