39

In this blog post by Andrew Gelman, there is the following passage:

The Bayesian models of 50 years ago seem hopelessly simple (except, of course, for simple problems), and I expect the Bayesian models of today will seem hopelessly simple, 50 years hence. (Just for a simple example: we should probably be routinely using t instead of normal errors just about everwhere, but we don’t yet do so, out of familiarity, habit, and mathematical convenience. These may be good reasons–in science as in politics, conservatism has many good arguments in its favor–but I think that ultimately as we become comfortable with more complicated models, we’ll move in that direction.)

Why should we "routinely be using t instead of normal errors just about everywhere"?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Potato
  • 1,025
  • 1
  • 11
  • 12

2 Answers2

52

Because, assuming normal errors is effectively the same as assuming that large errors do not occur! The normal distribution has so light tails, that errors outside $\pm 3$ standard deviations have very low probability, errors outside of $\pm 6$ standard deviations are effectively impossible. In practice, that assumption is seldom true. When analyzing small, tidy datasets from well designed experiments, this might not matter much, if we do a good analysis of residuals. With data of lesser quality, it might matter much more.

When using likelihood-based (or bayesian) methods, the effect of this normality (as said above, effectively this is the "no large errors"-assumption!) is to make the inference very little robust. The results of the analysis are too heavily influenced by the large errors! This must be so, since assuming "no large errors" forces our methods to interpret the large errors as small errors, and that can only happen by moving the mean value parameter to make all the errors smaller. One way to avoid that is to use so-called "robust methods", see http://web.archive.org/web/20160611192739/http://www.stats.ox.ac.uk/pub/StatMeth/Robust.pdf

But Andrew Gelman will not go for this, since robust methods are usually presented in a highly non-bayesian way. Using t-distributed errors in likelihood/bayesian models is a different way to obtain robust methods, as the $t$-distribution has heavier tails than the normal, so allows for a larger proportion of large errors. The number of degrees of freedom parameter should be fixed in advance, not estimated from the data, since such estimation will destroy the robustness properties of the method (*) (it is also a very difficult problem, the likelihood function for $\nu$, the number degrees of freedom, can be unbounded, leading to very inefficient (even inconsistent) estimators).

If, for instance, you think (are afraid) that as much as 1 in ten observations might be "large errors" (above 3 sd), then you could use a $t$-distribution with 2 degrees of freedom, increasing that number if the proportion of large errors is believed to be smaller.

I should note that what I have said above is for models with independent $t$-distributed errors. There have also been proposals of multivariate $t$-distribution (which is not independent) as error distribution. That propsal is heavily criticized in the paper "The emperor's new clothes: a critique of the multivariate $t$ regression model" by T. S. Breusch, J. C. Robertson and A. H. Welsh, in Statistica Neerlandica (1997) Vol. 51, nr. 3, pp. 269-286, where they show that the multivariate $t$ error distribution is empirically indistinguishable from the normal. But that criticism do not affect the independent $t$ model.

(*) One reference stating this is Venables & Ripley's MASS---Modern Applied Statistics with S (on page 110 in 4th edition).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 3
    Excellent answer (+1). Note that even when $\nu$ is fixed, the estimating equations are ill defined if [$\nu\leq2$](http://www3.stat.sinica.edu.tw/statistica/oldpdf/a5n12.pdf) so I take to mean that Gelman means $t$ distribution with $\nu$ parameter fixed to $\nu>2$. As illustrated in the answer to this related [question](http://stats.stackexchange.com/questions/82128/what-would-a-robust-bayesian-model-for-estimating-the-scale-of-a-roughly-normal) This places rather strong limits on the robustness that can be expected of this approach. – user603 Oct 20 '14 at 18:27
  • 2
    Great Answer and comment. But: 1. Gelman is defending a standard procedure that will be better than assuming Normal errors. So, we should compare the simple (Normal errors) with the T distribution for the errors. 2. In the related question linked by user603, we should note that if qe have prior information, we should use it. Bayes excels with prior information. And in the exmaple, we do have prior information that is not used. 3. With posterior predictive checks we`d know that the model proposed isn`t good enough. – Manoel Galdino Oct 21 '14 at 00:07
  • This answer does not address why it should be a t-distribution and not any of the other heavy-tailed bell shaped distributions, such as the Cauchy distribution. Also, you don't simply "add degrees of freedom" if ou are are "afraid" of large errors. The degrees of freedom directly correspond to the amount of data used to train your model. – Neil G Apr 29 '15 at 10:54
  • 1
    @Neil G: Yes, but the cauchy **is** $t_1$! Addressing exactly which heavy-tailed distribution to use of course needs much more analysis. – kjetil b halvorsen Apr 29 '15 at 10:58
  • 1
    No, the t-distribution is the *only* choice because the t-distribution is the posterior predictive of the Gaussian model. Gelman wasn't just picking the t-distribution at random. – Neil G Apr 29 '15 at 10:59
  • I dont understand your argument, @Neil G – kjetil b halvorsen Apr 29 '15 at 11:04
  • 1
    See: Murphy, Kevin P. "Conjugate Bayesian analysis of the Gaussian distribution." def 1.2σ2 (2007): 16. He derives the t-distribution as the posterior predictive of the Gaussian model. It is not merely a case of the modeler choosing an arbitrary heavy-tailed distribution. – Neil G Apr 29 '15 at 11:07
  • @Neil G: you are confused. That paper starts with a gaussian likelihood and does an bayesian analysis, and yes, **predictive** distributions then are $t$. That is no argument for using the $t$-dist to construct a likelihood, a likelihood starts out with the data distribution, that is, the distribution of the data-generating process, not some predictive distribution. Your argument does not make any sense. – kjetil b halvorsen Apr 29 '15 at 12:35
  • You are the one who is confused. The data-generating process is assumed to be Gaussian. The posterior predictive is therefore t-distributed. When you're using your model whose parameters were inferred from the data to make inferences about future observations, the prediction errors of the future observations are distributed based on the posterior predictive distribution. This is how all Bayesian models work. – Neil G Apr 29 '15 at 12:48
  • Sorry, but I know about Bayesian models. I did not use or refere to Bayesian inference in my answer, neither was it asked about it. I guess this is becoming unproductive, you need to get your basics straight! – kjetil b halvorsen Apr 29 '15 at 13:00
  • 1
    I have clarified my answer since I think your misunderstanding is wondering why posterior predictives would be relevant with likelihoods. A parameter whose uncertainty is assumed to be Gaussian is like a mini-Gaussian model within a model. After that parameter is "inferred" (which means in the language of machine learning, data induces a likelihood on it), the actual uncertainty about the parameter is t-distributed due to finite evidence. – Neil G Apr 29 '15 at 13:12
  • Implicit in the t-distribution choice seems that it gives flexibility by incrementally increasing or decreasing the degrees of freedom (but still being fixed in advance as per the MASS reference). As someone new to Bayesian I would default to using $df=1$, that is Cauchy, which has undefined variance. You mention $df=2$ in your answer which has infinite variance. Would choosing $df=1$ be a problem, perhaps infinite variance is better than undefined? Or maybe I am reading too much into the $df=2$ in your answer. – Single Malt Nov 09 '21 at 08:01
13

It is not just a matter of "heavier tails" — there are plenty of distributions that are bell shaped and have heavy tails.

The T distribution is the posterior predictive of the Gaussian model. If you make a Gaussian assumption, but have finite evidence, then the resulting model is necessarily making non-central scaled t-distributed predictions. In the limit, as the amount of evidence you have goes to infinity, you end up with Gaussian predictions since the limit of the t distribution is Gaussian.

Why does this happen? Because with a finite amount of evidence, there is uncertainty in the parameters of your model. In the case of the Gaussian model, uncertainty in the mean would merely increase the variance (i.e., the posterior predictive of a Gaussian with known variance is still Gaussian). But uncertainty about the variance is what causes the heavy tails. If the model is trained with unlimited evidence, there is no longer any uncertainty in the variance (or the mean) and you can use your model to make Gaussian predictions.

This argument applies for a Gaussian model. It also applies to a parameter that is inferred whose likelihoods are Gaussian. Given finite data, the uncertainty about the parameter is t-distributed. Wherever there are Normal assumptions (with unknown mean and variance), and finite data, there are t-distributed posterior predictives.

There are similar posterior predictive distributions for all of the Bayesian models. Gelman is suggesting that we should be using those. His concerns would be mitigated by sufficient evidence.

Neil G
  • 13,633
  • 3
  • 41
  • 84
  • Can you back up this with some references? – kjetil b halvorsen Apr 29 '15 at 11:05
  • 2
    @kjetilbhalvorsen: Murphy, Kevin P. "Conjugate Bayesian analysis of the Gaussian distribution." def 1.2σ2 (2007): 16. – Neil G Apr 29 '15 at 11:08
  • Interesting perspective, I'd never heard this before. So do t-distributed errors also lead to t-distributed predictions? This to me this is an argument in _favor_ of continuing to use Gaussian errors. Unless you expect _conditional_ outliers, the conditional error model doesn't need to allow for them. This amounts to the assumption that all the outlying-ness comes from outlying values of the predictors. I don't think that assumption is so bad in a lot of cases. And on purely aesthetic grounds, I don't see why the conditional and marginal distributions have to match – shadowtalker Apr 29 '15 at 13:29
  • @ssdecontrol "Do t-distributed errors also lead to t-distributed predictions?" I don't know, but I don't think so. For me, this perspective is very useful for an intuitive understanding of why the t-test works. – Neil G Apr 29 '15 at 13:31