28

Is it true that Bayesian methods don't overfit? (I saw some papers and tutorials making this claim)

For example, if we apply a Gaussian Process to MNIST (handwritten digit classification), but only show it a single sample, will it revert to the prior distribution for any inputs different from that single sample, however small the difference?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
MWB
  • 1,143
  • 9
  • 18
  • 1
    was just thinking - is there a mathematically precise way you can define "over fitting"? if you can, it is likely you can also build features into a likelihood function or a prior to avoid it happening. my thinking is that this notion sounds similar to "outliers". – probabilityislogic May 05 '19 at 13:49

2 Answers2

30

No, it is not true. Bayesian methods will certainly overfit the data. There are a couple of things that make Bayesian methods more robust against overfitting and you can make them more fragile as well.

The combinatoric nature of Bayesian hypotheses, rather than binary hypotheses allows for multiple comparisons when someone lacks the "true" model for null hypothesis methods. A Bayesian posterior effectively penalizes an increase in model structure such as adding variables while rewarding improvements in fit. The penalties and gains are not optimizations as would be the case in non-Bayesian methods, but shifts in probabilities from new information.

While this generally gives a more robust methodology, there is an important constraint and that is using proper prior distributions. While there is a tendency to want to mimic Frequentist methods by using flat priors, this does not assure a proper solution. There are articles on overfitting in Bayesian methods and it appears to me that the sin seems to be in trying to be "fair" to non-Bayesian methods by starting with strictly flat priors. The difficulty is that the prior is important in normalizing the likelihood.

Bayesian models are intrinsically optimal models in Wald's admissibility sense of the word, but there is a hidden bogeyman in there. Wald is assuming the prior is your true prior and not some prior you are using so that editors won't ding you for putting too much information in it. They are not optimal in the same sense that Frequentist models are. Frequentist methods begin with the optimization of minimizing the variance while remaining unbiased.

This is a costly optimization in that it discards information and is not intrinsically admissible in the Wald sense, though it frequently is admissible. So Frequentist models provide an optimal fit to the data, given unbiasedness. Bayesian models are neither unbiased nor optimal fits to the data. This is the trade you are making to minimize overfitting.

Bayesian estimators are intrinsically biased estimators, unless special steps are taken to make them unbiased, that are usually a worse fit to the data. Their virtue is that they never use less information than an alternative method to find the "true model" and this additional information makes Bayesian estimators never more risky than alternative methods, particularly when working out of sample. That said, there will always exist a sample that could have been randomly drawn that would systematically "deceive" the Bayesian method.

As to the second part of your question, if you were to analyze a single sample, the posterior would be forever altered in all its parts and would not revert to the prior unless there was a second sample that exactly cancelled out all the information in the first sample. At least theoretically this is true. In practice, if the prior is sufficiently informative and the observation sufficiently uninformative, then the impact could be so small that a computer could not measure the differences because of the limitation on the number of significant digits. It is possible for an effect to be too small for a computer to process a change in the posterior.

So the answer is "yes" you can overfit a sample using a Bayesian method, particularly if you have a small sample size and improper priors. The second answer is "no" Bayes theorem never forgets the impact of prior data, though the effect could be so small you miss it computationally.

Dave Harris
  • 6,957
  • 13
  • 21
  • 2
    In *They begin with the optimization of minimizing the variance while remaining unbiased.*, what is *They*? – Richard Hardy Mar 04 '17 at 09:10
  • Only a very few models (essentially a set with measure zero) permit the formation of unbiased estimators. For example, in a normal $N(\theta, \sigma^2)$ model, there is no unbiased estimator of $\sigma$. Indeed, most times we maximize a likelihood, we end up with a biased estimator. – Andrew M Sep 30 '17 at 15:03
  • 1
    @AndrewM: There *is* an unbiased estimator of $\sigma$ in a normal model - https://stats.stackexchange.com/a/251128/17230. – Scortchi - Reinstate Monica Apr 25 '18 at 13:15
  • @nbro No, I do not. I have not worked in neural networks in so many years that little I would say would be trustworthy. – Dave Harris Apr 14 '20 at 20:45
  • *Bayesian models are intrinsically biased models*: do you perhaps mean estimators rather than models? (I am thinking about model bias in terms of bias-variance trade-off.) Also, *Bayesian models never less risky than alternative models*? Not exactly sure what you mean by risky, but is it perhaps the opposite? I found in another answer of yours that *all Bayesian estimators <...> are intrinsically the least risky way to calculate an estimator.* – Richard Hardy Apr 05 '21 at 04:11
  • @RichardHardy thanks for the catch. Fixed it. – Dave Harris Apr 05 '21 at 15:11
  • "The combinatoric nature of Bayesian hypotheses, rather than binary hypotheses allows for multiple comparisons when someone lacks the "true" model for null hypothesis methods." How can a hypothesis be Bayesian? Being "Bayesian" isn't about the kind of hypotheses one is interested in, or is it? Also, how is it less of a problem for a Bayesian if the assumed model is not true? – Christian Hennig Apr 05 '21 at 15:17
  • @Lewian this should probably be a question in itself. Consider the Frequentist hypothesis $\theta\ge{5}$. The hypothesis could be read as "it is a statement of fact that $\theta\ge{5}$. Now consider a subjective Bayesian where $\theta\in\mathcal{N}(\mu,\sigma^2)$ with the same hypothesis. It could be read as "how often is $\theta\ge{5}$ or with what probability is that the case. So, for starters, the type of question isn't the same. The second issue, the combinatoric nature, allows the mixing and matching of variable combinations. (continued) – Dave Harris Apr 06 '21 at 14:26
  • @Lewian The Frequentist null hypothesis is by force of math assumed to be true. Only one model form is really open for discussion. There will be a Bayesian hypothesis that matches the Frequentist null and alternative hypotheses, but there will be others as well, or at least there can be. As such, it is easier to capture issues such as misspecification. That permits models to be a bit more robust. Note that if the Frequentist in this example lacks the true model and the true model is not a subset of the combinations, then the Bayesian lacks it as well. (continued) – Dave Harris Apr 06 '21 at 14:31
  • @Lewian the difference between the methods, however, is that the null is asserted to be the truth. One could not calculate a p-value otherwise. The assertion is consequential. See Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6, 291-298. – Dave Harris Apr 06 '21 at 14:34
  • @Lewian however, to be fair in terms of overfitting, tools such as the AIC, BIC and other information criteria reduce the risk of overfitting if you use a selection process such as step-wise regression or other alternatives. – Dave Harris Apr 06 '21 at 14:35
  • Thanks for your efforts. Chances are we shouldn't have a discussion regarding foundations here. Anyway, I agree that "the type of question isn't the same" (although this can differ between different varieties of Bayesians as well). Regarding robustness against misspecification, my impression is that there is more about this in the frequentist than in the Bayesian literature, or at least as much, beginning from the work of Huber, Hampel, Tukey in the sixties, although it is true that this is often ignored in practice. (There's more to frequentism than testing parametric point hypotheses.) – Christian Hennig Apr 06 '21 at 14:42
19

Something to be aware of is that like practically everywhere else, a significant problem in Bayesian methods can be model misspecification.

This is an obvious point, but I thought I'd still share a story.

A vignette from back in undergrad...

A classic application of Bayesian particle filtering is to track the location of a robot as it moves around a room. Movement expands uncertainty while sensor readings reduce uncertainty.

I remember coding up some routines to do this. I wrote out a sensible, theoretically motivated model for the likelihood of observing various sonar readings given the true values. Everything was precisely derived and coded beautifully. Then I go to test it...

What happened? Total failure! Why? My particle filter rapidly thought that the sensor readings had eliminated almost all uncertainty. My point cloud collapsed to a point, but my robot wasn't necessarily at that point!

Basically, my likelihood function was bad; my sensor readings weren't as informative as I thought they were. I was overfitting. A solution? I mixed in a ton more Gaussian noise (in a rather ad-hoc fashion), the point cloud ceased to collapse, and then the filtering worked rather beautifully.

Moral?

As Box famously said, "all models are wrong, but some are useful." Almost certainly, you won't have the true likelihood function, and if it's sufficiently off, your Bayesian method may go horribly awry and overfit.

Adding a prior doesn't magically solve problems stemming from assuming observations are IID when they aren't, assuming the likelihood has more curvature than it does etc...

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • 7
    "A vignette from back in undergrad...A classic application of Bayesian particle filtering is to track the location of a robot as it moves around a room"...whoa, where was your undergrad? :) – Cliff AB Apr 25 '17 at 16:41
  • 3
    @CliffAB, [Bachelor of Arts and Sciences in Economics and Symbolic Systems at Stanford](http://www.mattgunn.com/matthew_daniel_gunn_cv.pdf). – Richard Hardy Apr 05 '21 at 03:30