87

Edits: I have added a simple example: inference of the mean of the $X_i$. I have also slightly clarified why the credible intervals not matching confidence intervals is bad.

I, a fairly devout Bayesian, am in the middle of a crisis of faith of sorts.

My problem is the following. Assume that I want to analyse some IID data $X_i$. What I would do is:

  • first, propose a conditional model: $$ p(X|\theta) $$

  • Then, choose a prior on $\theta$: $$ p(\theta) $$

  • Finally, apply Bayes' rule, compute the posterior: $p(\theta | X_1 \dots X_n )$ (or some approximation to it if it should be uncomputable) and answer all questions I have about $\theta$

This is a sensible approach: if the true model of the data $X_i$ is indeed "inside" of my conditional (it corresponds to some value $\theta_0$), then I can call upon statistical decision theory to say that my method is admissible (see Robert's "The Bayesian choice" for details; "All of statistics" also gives a clear account in the relevant chapter).

However, as everybody knows, assuming that my model is correct is fairly arrogant: why should nature fall neatly inside the box of the models which I have considered? It is much more realistic to assume that the real model of the data $p_{true}(X)$ differs from $p(X|\theta)$ for all values of $\theta$. This is usually called a "misspecified" model.

My problem is that, in this more realistic misspecified case, I don't have any good arguments for being Bayesian (i.e: computing the posterior distribution) versus simply computing the Maximum Likelihood Estimator (MLE):

$$ \hat \theta_{ML} = \arg \max_\theta [ p(X_1 \dots X_n |\theta) ] $$

Indeed, according to Kleijn, v.d Vaart (2012), in the misspecified case, the posterior distribution:

  • converges as $n\rightarrow \infty $ to a dirac distribution centered at a $\hat \theta_{ML} $

  • does not have the correct variance (unless two values just happen to be same) in order to ensure that credible intervals of the posterior match confidence intervals for $\theta$. (Note that, while confidence intervals are obviously something that Bayesians don't care about excessively, this qualitatively means that the posterior distribution is intrinsically wrong, as it implies that its credible intervals do not have correct coverage)

Thus, we are paying a computational premium (Bayesian inference, in general, is more expensive than MLE) for no additional properties

Thus, finally, my question: are there any arguments, whether theoretical or empirical, for using Bayesian inference over the simpler MLE alternative when the model is misspecified?

(Since I know that my questions are often unclear, please let me known if you don't understand something: I'll try to rephrase it)

Edit: let's consider a simple example: infering the mean of the $X_i$ under a Gaussian model (with known variance $\sigma$ to simplify even further). We consider a Gaussian prior: we denote $\mu_0$ the prior mean, $\beta_0$ the inverse variance of the prior. Let $\bar X$ be the empirical mean of the $X_i$. Finally, note: $\mu = (\beta_0 \mu_0 + \frac{n}{\sigma^2} \bar X) / (\beta_0 + \frac{n}{\sigma^2} )$.

The posterior distribution is:

$$ p(\theta |X_1 \dots X_n)\; \propto\; \exp\!\Big( - (\beta_0 + \frac{n}{\sigma^2} ) (\theta - \mu)^2 / 2\Big) $$

In the correctly specified case (when the $X_i$ really have a Gaussian distribution), this posterior has the following nice properties

  • If the $X_i$ are generated from a hierarchical model in which their shared mean is picked from the prior distribution, then the posterior credible intervals have exact coverage. Conditional on the data, the probability of $\theta$ being in any interval is equal to the probability that the posterior ascribes to this interval

  • Even if the prior isn't correct, the credible intervals have correct coverage in the limit $n\rightarrow \infty$ in which the prior influence on the posterior vanishes

  • the posterior further has good frequentist properties: any Bayesian estimator constructed from the posterior is guaranteed to be admissible, the posterior mean is an efficient estimator (in the Cramer-Rao sense) of the mean, credible intervals are, asymptotically, confidence intervals.

In the misspecified case, most of these properties are not guaranteed by the theory. In order to fix ideas, let's assume that the real model for the $X_i$ is that they are instead Student distributions. The only property that we can guarantee (Kleijn et al) is that the posterior distribution concentrates on the real mean of the $X_i$ in the limit $n \rightarrow \infty$. In general, all the coverage properties would vanish. Worse, in general, we can guarantee that, in that limit, the coverage properties are fundamentally wrong: the posterior distribution ascribes the wrong probability to various regions of space.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Guillaume Dehaene
  • 2,137
  • 1
  • 10
  • 18
  • 4
    Well, Bayesian approaches regularize. That is something, to help against overfitting - whether or not your model is misspecified. Of course, that just leads to the related question about arguments for Bayesian inference against *regularized* classical approaches (lasso, ridge regression, elastic net etc.). – Stephan Kolassa Apr 20 '17 at 15:43
  • 3
    You might be interested in [this work](https://arxiv.org/abs/1412.3730) and its relatives. – Danica Apr 20 '17 at 15:45
  • 8
    When your model is misspecified in terms of using incorrect likelihood function, then both MLE and Bayesian estimates would be wrong... – Tim Apr 20 '17 at 15:45
  • 1
    @Tim, you probably missed this: *we are paying a computational premium (Bayesian inference, in general, is more expensive than MLE) for no additional properties*. – Richard Hardy Apr 20 '17 at 15:50
  • 6
    @Tim: the MLE and Bayesian inference are not meaningless in the misspecified case: they both try to recover the parameter value $\tilde \theta_0$ which gives the best account of the data within the conditional models. More precisely, $\tilde \theta_0$ is the argmin of $ KL[p(X), p(X|\theta)] $ where KL is the Kullback Leibler divergence. Under mild assumptions, both MLE and Bayesian inference correctly identify this $\tilde \theta_0$ when provided with a sufficient amount of data – Guillaume Dehaene Apr 20 '17 at 17:55
  • I am confused by your second bullet point. You say that in the misspecified case, credible interval of the posterior will not match the confidence interval; but why should any Bayesian consider such "matching" as something desirable? Confidence intervals is a frequentist tool; hard-core Bayesians usually claim that CIs are misguided and should not be trusted. – amoeba Apr 20 '17 at 20:42
  • 3
    @amoeba I imagine hard-core Bayesian look and act like comandante Che – Aksakal Apr 20 '17 at 20:57
  • @GuillaumeDehaene: There is one place where Bayesian models completely break down and that is when you place zero prior probability somewhere where there should be nonzero probability (or vice-versa). For example, it doesn't matter how many data points (e.g votes) you gather in support of the idea that the speed of light is infinite; it simply isn't, even if 6 billion people say otherwise. It sounds stupid in hindsight when you already know the fact but it should make you pause when you're doing something similar about something you don't actually know. – user541686 Apr 20 '17 at 23:54
  • @Mehrdad Your example begs the question of where folks with the privileged perspective got a prior belief that was somehow unavailable to 6 billion. It feels like you are inserting a non-Bayesian interpretation of probability into the example. – Alexis Apr 21 '17 at 01:53
  • 1
    Regarding your edit: but credible intervals are not supposed to have correct coverage! If you want coverage guarantees, you should be doing frequentist statistics, this is what it's all about. Bayesian statistics does *not* generally promise you correct coverage, this is admitted and accepted (and even praised!) by the Bayesians. This seems to be a big confusion here. – amoeba Apr 21 '17 at 07:42
  • Credible intervals are defined so as to have a coverage property: if we sample from a hierarchical model, conditional on the data we observed, then the probability of $\theta$ falling inside a credible interval with level $\alpha$ should be equal to $\alpha$. Maybe the word "coverage" is incorrect here, I guess? Anyway, this property fails in the misspecified case – Guillaume Dehaene Apr 21 '17 at 07:52
  • This is not correct. You don't need the concept of coverage to define a credible interval. If you're really concerned about model misspecification, you should use a Bayesian Nonparametric approach and understand the subtleties of how to control the support of your prior. I you do so, and care about, your Bayesian procedures will have good "what if"/frequentist properties. A good starting point is this paper: https://www.google.com.br/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=barron+wasserman+schervish – Zen Apr 21 '17 at 12:04
  • Maybe the word "coverage" is not the appropriate one (though I can't find an alternative). The point is that credible intervals have a single nice property in the correctly specified case which vanishes in the misspecified case. Thank you for the reference. It seems like they only prove the same result as Kleijn, Vaart: in the large-data limit, the posterior distribution converges to a (close equivalent to) a diract at the correct value. They fail to prove that the rate of contraction of the posterior around the truth has the correct properties. I'll check it out in detail. – Guillaume Dehaene Apr 21 '17 at 12:16
  • They don't "fail" to prove. After the work of Lorraine Schwarz and others (Freedman, Diaconis etc), this BSW work is one of the pioneering papers in this field of research. Contraction rates have been studied extensively since then. Anyway, this is not something to be understood in 12 minutes (the time between my comment and yours). – Zen Apr 21 '17 at 14:02
  • Reorganized and expanded my answer to make better sense. I think it may be helpful to think of a toy domain which includes some of the complexity of the real-world problem (I feel that inferring the mean of a distribution is somewhat too simple). My proposal is that optimization is a good sandbox. I am interested in your thoughts about it. – lacerbi Apr 21 '17 at 16:04
  • Added extra reference to my answer. – lacerbi May 16 '17 at 16:25
  • MLE has assumptions as well. It is often used inappropriately. One Bayesian model may be vastly inferior in predictive ability than another, that is, if the model is less well matched to the data than an alternative model. If one model is not working properly, the first step is to search for a better model to the data, and only then apply the new model in a tailored, goal-oriented Bayesian context. – Carl Jan 16 '19 at 23:53

11 Answers11

34

I consider Bayesian approach when my data set is not everything that is known about the subject, and want to somehow incorporate that exogenous knowledge into my forecast.

For instance, my client wants a forecast of the loan defaults in their portfolio. They have 100 loans with a few years of quarterly historical data. There were a few occurrences of delinquency (late payment) and just a couple of defaults. If I try to estimate the survival model on this data set, it'll be very little data to estimate and too much uncertainty to forecast.

On the other hand, the portfolio managers are experienced people, some of them may have spent decades managing relationships with borrowers. They have ideas around what the default rates should be like. So, they're capable of coming up with reasonable priors. Note, not the priors which have nice math properties and look intellectually appealing to me. I'll chat with them and extract their experiences and knowledge in the form of those priors.

Now Bayesian framework will provide me with mechanics to marry the exogenous knowledge in the form of priors with the data, and obtain the posterior that is superior to both pure qualitative judgment and pure data driven forecast, in my opinion. This is not a philosophy and I'm not a Bayesian. I'm just using the Bayesian tools to consistently incorporate expert knowledge into the data-driven estimation.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • 3
    A very nice point. Bayesian inference does offer a framework for solving precisely a task like the one you have presented. Thank you. – Guillaume Dehaene Apr 20 '17 at 17:57
  • 9
    This is a general argument for Bayesian modelling, but how does it relate to the specific case of misspecified model? I do not see a connection. – Richard Hardy Apr 21 '17 at 05:31
  • 5
    Well, it does relate to my question: even in the misspecified case, bayesian inference does handle better (i.e: in a more principled fashion) qualitative information, via the prior, than MLE methods, which would have to work with regularizers. It's a form of empirical argument as to why bayesian inference is slightly better than MLE. – Guillaume Dehaene Apr 21 '17 at 07:01
  • @RichardHardy In econometrics and finance I assume that all models are misspecified, pretty much. I'm sure that in social sciences and business the situation is even worse. This is not physics or chemistry. – Aksakal Apr 21 '17 at 11:19
  • (+1) for _"Note, not the priors which have nice math properties and look intellectually appealing to me. I'll chat with them and extract their experiences and knowledge in the form of those priors."_ – Alecos Papadopoulos Apr 21 '17 at 11:24
  • 2
    @Aksakal, whether models are misspecified is besides the point. What I am concerned with is that you do not answer the question. (If the OP disagrees, then I think he has done a poor job in formulating the question.) But I see there has been a recent edit, so perhaps the question has been changed by now. – Richard Hardy Apr 21 '17 at 11:45
  • 6
    @RichardHardy, I think my answer goes into the heart of OP's crisis of faith which is driven by the thought that if your conditional model is misspecified then it'll overpower the prior with increasing sample size and your posterior will be pushed towards the wrong model. In this case why bother about Bayesian to start with, why not just to straight MLE, he asks. My example is decidedly not philosophical, but practical: you deal often not with just finite, but small samples. So, your data will not drag the posterior too far away from the prior, which represents the exogenous knowledge. – Aksakal Apr 21 '17 at 16:02
  • Now the last sentence is *the* answer. It took some time to extract, but it was worth the effort. – Richard Hardy Apr 21 '17 at 16:21
29

A very interesting question...that may not have an answer (but that does not make it less interesting!)

A few thoughts (and many links to my blog entries!) about that meme that all models are wrong:

  1. While the hypothetical model is indeed almost invariably and irremediably wrong, it still makes sense to act in an efficient or coherent manner with respect to this model if this is the best one can do. The resulting inference produces an evaluation of the formal model that is the "closest" to the actual data generating model (if any);
  2. There exist Bayesian approaches that can do without the model, a most recent example being the papers by Bissiri et al. (with my comments) and by Watson and Holmes (which I discussed with Judith Rousseau);
  3. In a connected way, there exists a whole branch of Bayesian statistics dealing with M-open inference;
  4. And yet another direction I like a lot is the SafeBayes approach of Peter Grünwald, who takes into account model misspecification to replace the likelihood with a down-graded version expressed as a power of the original likelihood.
  5. The very recent Read Paper by Gelman and Hennig addresses this issue, albeit in a circumvoluted manner (and I added some comments on my blog). I presume you could gather material for a discussion from the entries about your question.
  6. In a sense, Bayesians should be the least concerned among statisticians and modellers about this aspect since the sampling model is to be taken as one of several prior assumptions and the outcome is conditional or relative to all those prior assumptions.
Glorfindel
  • 700
  • 1
  • 9
  • 18
Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • 2
    It's very nice to have your opinion on this. Your first point makes intuitive sense: if the model isn't too wrong, then the result of our inference should be ok. However, has anybody ever proved any result like that (or explored the question empirically)? Your last point (which I might have misunderstood) leaves me perplex: the sampling model is a critical choice. The fact that we also make also choices doesn't mean that errors in the choice of the sampling model can't taint the whole model. Thank you for the references and the wonderful blog. – Guillaume Dehaene Apr 21 '17 at 12:07
  • For point 1., why not Bayesian model averaging? Why just use the 'best' model? – innisfree Apr 21 '17 at 14:53
  • @innisfree: it all depends on what you plan to do with the outcome, I have no religion about model averaging versus best model. – Xi'an Apr 21 '17 at 15:50
  • 1
    You seem to be suggesting that there is a decision-theoretic aspect of averaging model uncertainty versus picking only 'best' model. Surely it's always advantagous, ie helps make better decisions, to coherently incorporate all uncertainties, including model uncertainties. – innisfree Apr 21 '17 at 16:00
  • @innisfree: yes I do mean it depends on the loss function one chooses to evaluate one's actions. For a zero-one loss function, choosing the most likely model is the optimal Bayes decision. – Xi'an Apr 21 '17 at 16:02
  • 1
    @GuillaumeDehaene: Thanks. Wrt the last point, I mean that all Bayesian statements are conditional on the chosen Universe. They do not (dare to) say anything outside that Universe. From a Bayesian perspective, if one entertains the notion of a possibly wrong model, one should include a model for being in the wrong model. One partly satisfactory solution is to go non-parametric. – Xi'an Apr 21 '17 at 16:04
  • 2
    My main objection to non-parametrics is practical: they are more computationally expensive by several orders of magnitudes compared to simpler alternatives. Furthermore, don't we also run into trouble with non-parametrics, because its almost impossible for two prior distributions to have common support? That means that the prior would have a heavy influence and that it would be (almost) impossible for bayesian statisticians to agree when starting from different priors. – Guillaume Dehaene Apr 24 '17 at 10:59
  • @Xian returning to this as I took your book from library. For a 0-1 loss function, mode is indeed optimal. But wouldn't it be mode of $p (x) = \sum p (x|m)p(m) $, which is model averaged? – innisfree Nov 20 '17 at 11:05
17

I only see this today but still I think I should chip in given that I'm kind of an expert and that at least two answers (nr 3 and 20 (thanks for referring to my work Xi'an!)) mention my work on SafeBayes - in particular G. and van Ommen, "Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It" (2014). And I'd also like to add something to comment 2:

2 says: (an advantage of Bayes under misspecification is ...) "Well, Bayesian approaches regularize. That is something, to help against overfitting - whether or not your model is misspecified. Of course, that just leads to the related question about arguments for Bayesian inference against regularized classical approaches (lasso etc)"

This is true, but it is crucial to add that Bayesian approaches may not regularize enough if the model is wrong. This is the main point of the work with Van Ommen - we see there that standard Bayes overfits rather terribly in some regression context with wrong-but-very-useful-models. Not as bad as MLE, but still way too much to be useful. There's a whole strand of work in (frequentist and game-theoretic) theoretical machine learning where they use methods similar to Bayes, but with a much smaller 'learning rate' - making the prior more and the data less important, thus regularizing more. These methods are designed to work well in worst-case situations (misspecification and even worse, adversarial data) - the SafeBayes approach is designed to 'learn the optimal learning rate' from the data itself - and this optimal learining rate, i.e. the optimal amount of regularization, in effect depends on geometrical aspects of model and underlying distribution (i.e. is the model convex or not).

Relatedly, there is a folk theorem (mentioned by several above) saying that Bayes will have the posterior concentrate on the distribution closest in KL divergence to the 'truth'. But this only holds under very stringent conditions - MUCH more stringent than the conditions needed for convergence in the well-specified case. If you're dealing with standard low dimensional parametric models and data are i.i.d. according to some distribution (not in the model) then the posterior will indeed concentrate around the point in the model that is closest to the truth in KL divergence. Now if you're dealing with large nonparametric models and the model is correct, then (essentially) your posterior will still concentrate around the true distribution given enough data, as long as your prior puts sufficient mass in small KL balls around the true distribution. This is the weak condition that is needed for convergence in the nonparametric case if the model is correct.

But if your model is nonparametric yet incorrect, then the posterior may simply not concentrate around the closest KL point, even if your prior puts mass close to 1 (!) there - your posterior may remain confused for ever, concentrating on ever-different distributions as time proceeds but never around the best one. In my papers I have several examples of this happening. THe papers that do show convergence under misspecification (e.g. Kleijn and van der Vaart) require a lot of additional conditions, e.g. the model must be convex, or the prior must obey certain (complicated) properties. This is what I mean by 'stringent' conditions.

In practice we're often dealing with parametric yet very high dimensional models (think Bayesian ridge regression etc.). Then if the model is wrong, eventually your posterior will concentrate on the best KL-distribution in the model but a mini-version of the nonparametric inconsistency still holds: it may take orders of magnitude more data before convergence happens - again, my paper with Van Ommen gives examples.

The SafeBayes approach modifies standard bayes in a way that guarantees convergence in nonparametric models under (essentially) the same conditions as in the well-specified case, i.e. sufficient prior mass near the KL-optimal distribution in the model (G. and Mehta, 2014).

Then there's the question of whether Bayes even has justification under misspecification. IMHO (and as also mentioned by several people above), the standard justifications of Bayes (admissibility, Savage, De Finetti, Cox etc) do not hold here (because if you realize your model is misspecified, your probabilities do not represent your true beliefs!). HOWEVER many Bayes methods can also be interpreted as 'minimum description length (MDL) methods' - MDL is an information-theoretic method which equates 'learning from data' with 'trying to compress the data as much as possible'. This data compression interpretation of (some) Bayesian methods remains valid under misspecification. So there is still some underlying interpretation that holds up under misspecification - nevertheless, there are problems, as my paper with van Ommen (and the confidence interval/credible set problem mentioned in the original post) show.

And then a final remark about the original post: you mention the 'admissibility' justification of Bayes (going back to Wald's complete class thm of the 1940s/50s). Whether or not this is truly a justification of Bayes really depends very much on one's precise definition of 'Bayesian inference' (which differs from researcher to researcher...). The reason is that these admissibility results allow the possibility that one uses a prior that depends on aspects of the problem such as sample size, and loss function of interest etc. Most 'real' Bayesians would not want to change their prior if the amount of data they have to process changes, or if the loss function of interest is suddenly changed. For example, with strictly convex loss functions, minimax estimators are also admissible - though not usually thought of as Bayesian! The reason is that for each fixed sample size, they are equivalent to Bayes with a particular prior, but the prior is different for each sample size.

Hope this is useful!

  • 3
    Welcome to CrossValidated and thanks for responding on this question. A minor note -- you can't rely on the answers being sorted in the same order as you see them; different people can sort in different orders (there's a choice of different sorting criteria at the top of the highest placed answer) and two of those criteria change over time. That is if you refer to them as "nr 3 and 20" people won't know which answers you mean. [I can only find ten answers as well.] – Glen_b May 15 '17 at 22:10
  • 1
    Thank you for a great answer Peter. I'm confused about your comment that Bayesian inference in the misspecified case requires very strong assumptions. Which assumptions are you explicitely referring to? Are you talking about the condition that the posterior needs to converge to a dirac distribution on the best parameter value? or are you talking about the more technical conditions on the likelihood which ensure asymptotic normality? – Guillaume Dehaene May 16 '17 at 06:25
  • Ok, thanks to Glen B (moderator) - I'll keep this in mind from now on. – Peter Grünwald May 22 '17 at 21:10
  • Guillaume - I'm updating the above to take your comment into account – Peter Grünwald May 22 '17 at 21:12
14

Edits: Added reference to this paper in the body, as requested by the OP.


I am giving an answer as a naive empirical Bayesian here.

First, the posterior distribution allows you to do computations that you simply cannot do with a straightforward MLE. The simplest case is that today's posterior is tomorrow's prior. Bayesian inference naturally allows for sequential updates, or more in general online or delayed combination of multiple sources of information (incorporating a prior is just one textbook instance of such combination). Bayesian Decision Theory with a nontrivial loss function is another example. I would not know what to do otherwise.

Second, with this answer I will try and argue that the mantra that quantification of uncertainty is generally better than no uncertainty is effectively an empirical question, since theorems (as you mentioned, and as far as I know) provide no guarantees.

Optimization as a toy model of scientific endeavor

A domain that I feel fully captures the complexity of the problem is a very practical, no-nonsense one, the optimization of a black-box function $f: \mathcal{X} \subset \mathbb{R}^D \rightarrow \mathbb{R}$. We assume that we can sequentially query a point $x \in \mathcal{X}$ and get a possibly noisy observation $y = f(x) + \varepsilon$, with $\varepsilon \sim \mathcal{N}(0,\sigma^2)$. Our goal is to get as close as possible to $x^* = \arg\min_x f(x)$ with the minimum number of function evaluations.

A particularly effective way to proceed, as you may expect, is to build a predictive model of what would happen if I query any $x^\prime \in \mathcal{X}$, and use this information to decide what to do next (either locally or globally). See Rios and Sahinidis (2013) for a review of derivative-free global optimization methods. When the model is complex enough, this is called a meta-model or surrogate-function or response surface approach. Crucially, the model could be a point estimate of $f$ (e.g., the fit of a radial basis network function to our observations), or we could be Bayesian and somehow get a full posterior distribution over $f$ (e.g., via a Gaussian process).

Bayesian optimization uses the posterior over $f$ (in particular, the joint conditional posterior mean and variance at any point) to guide the search of the (global) optimum via some principled heuristic. The classical choice is to maximize the expected improvement over the current best point, but there are even fancier methods, like minimizing the expected entropy over the location of the minimum (see also here).

The empirical result here is that having access to a posterior, even if partially misspecified, generally produces better results than other methods. (There are caveats and situations in which Bayesian optimization is no better than random search, such as in high dimensions.) In this paper, we perform an empirical evaluation of a novel BO method vs. other optimization algorithms, checking whether using BO is convenient in practice, with promising results.

Since you asked -- this has a much higher computational cost than other non-Bayesian methods, and you were wondering why we should be Bayesian. The assumption here is that the cost involved in evaluating the true $f$ (e.g., in a real scenario, a complex engineering or machine learning experiment) is much larger than the computational cost for the Bayesian analysis, so being Bayesian pays off.

What can we learn from this example?

First, why does Bayesian optimization work at all? I guess that the model is wrong, but not that wrong, and as usual wrongness depends on what your model is for. For example, the exact shape of $f$ is not relevant for optimization, since we could be optimizing any monotonic transformation thereof. I guess nature is full of such invariances. So, the search we are doing might not be optimal (i.e., we are throwing away good information), but still better than with no uncertainty information.

Second, our example highlights that it is possible that the usefulness of being Bayesian or not depends on the context, e.g. the relative cost and amount of available (computational) resources. (Of course if you are a hardcore Bayesian you believe that every computation is Bayesian inference under some prior and/or approximation.)

Finally, the big question is -- why are the models we use not-so-bad after all, in the sense that the posteriors are still useful and not statistical garbage? If we take the No Free Lunch theorem, apparently we shouldn't be able to say much, but luckily we do not live in a world of completely random (or adversarially chosen) functions.

More in general, since you put the "philosophical" tag... I guess we are entering the realm of the problem of induction, or the unreasonable effectiveness of mathematics in the statistical sciences (specifically, of our mathematical intuition & ability to specify models that work in practice) -- in the sense that from a purely a priori standpoint there is no reason why our guesses should be good or have any guarantee (and for sure you can build mathematical counterexamples in which things go awry), but they turn out to work well in practice.

lacerbi
  • 4,816
  • 16
  • 44
  • 3
    Awesome answer. Thank you very much for your contribution. Is there any review / fair comparison of Bayesian optimization vs normal optimization techniques that highlights that the Bayesian version is empirically better as you claim? (I'm quite fine with taking you at your word, but a reference would be useful) – Guillaume Dehaene Apr 24 '17 at 14:11
  • 1
    Thanks! I think that the [probabilistic numerics](https://arxiv.org/pdf/1506.01326.pdf) call-to-arms contains several theoretical and empirical arguments. I am not aware of a benchmark that really compares BO methods with standard methods, but [*trigger warning: shameless plug*] I am currently working on something along these lines within the field of computational neuroscience; I plan to put some of the results on arXiv, hopefully within the next few weeks. – lacerbi Apr 24 '17 at 14:21
  • Indeed, at least their figure 2 has a clear comparison. Could you please add you work to your main question once it is out? I feel like it would be a valuable addition. – Guillaume Dehaene Apr 24 '17 at 14:25
  • Yes -- that's their method for adaptive Bayesian quadrature, which is a pretty cool idea (in practice, its effectiveness depends on whether the GP approximation works; which is often near-equivalent to say that you have a sensible parameterization of your problem). I will add the link to the answer when my work is available, thanks. – lacerbi Apr 24 '17 at 14:30
  • I feel like the example / paper does not capture the (practical) problems of scientific endeavor. Even with a black box approach, you make sampling assumptions that almost never strictly hold for real data. We can not query several data points of a fixed DGP, and in particular this means that based on $x$, our function will not have a (Gaussian) additive error. $x$ will (in practice) not have sufficient information for this to hold. So indeed, our model - in particular this GP one - is likely wrong. This is probably the number 1 problem we face, and one would ask if your ex. sidesteps this. – IMA Oct 21 '19 at 11:25
  • For example, if the assumptions would hold, we would be justified in throwing an arbitrarily complex neural network at the problem and call it a day. But if we take our theory seriously, then there is a trade-off between our modeled process being "as right as possible" ( meaning general), and our estimation being identified. This sounds non-Bayesian, but it holds here as well. – IMA Oct 21 '19 at 11:29
  • And this is why, I think, the majority of papers with observational data move toward a non-Bayesian, (semi)non-parametric approaches trying to mimic experiments. When defending your empirical analysis, assumptions such as Gaussian errors are a non-starter. Only actual experiments would have data where this assumption makes sense. Instead, one is asked to exploit features in the data that do not rely on such "functional" assumptions. And when even these simple features of the true DGP are difficult to estimate, how heroic is it to think we can truly approximate the whole distribution? – IMA Oct 21 '19 at 11:35
  • So that being said, I fully agree when it comes to your comparison of fully specified models. In particular, the transparent process of deriving knowledge from (new) data is an advantage of Bayesian over other models even if all models are wrong. But if all models are wrong, and in particular, wrong enough so that no GP can be assumed, the question is still valid - should we not use other statistical approaches? Is our knowledge of a completely wrong model, provided by Bayesian analysis, at all useful? – IMA Oct 21 '19 at 11:43
  • 1
    @IMA: Sorry, I don't think that I 100% get your point. I was taking black-box optimization as a _toy model_ of the scientific endeavor. I believe you can map many steps and problems of "science" to this simpler (but still incredibly complex) domain. There is no need for the "Gaussian noise" assumption for my argument, it was just for simplicity. Real-world optimization problems (e.g., in engineering) may be corrupted by non-Gaussian noise, and that's something that needs to be dealt with. And Gaussian processes do not _need_ Gaussian observation noise (although it makes inference easy). – lacerbi Oct 21 '19 at 18:12
  • Sorry, then I misunderstand your point, because both in the paper and the example, it seems to me that you assume an additive noise term with a stable distribution (and indeed it seems to me that this is Gaussian). I claim, I think that is quite accepted, that this is a heroic. Especially in nonexperimental sciences, where the amount of data we have, compared to factors generating $y$, is relatively small, it is already sort of misleading to think of it as noise. It is unobserved variation, as most significant issue. This then leads to to try to estimate "easier" things than posteriors etc. – IMA Oct 21 '19 at 19:04
  • Of course whether one assumes a Gaussian process, or a Gaussian noise term, is often not significant, as for many parameters of interest we can suitable decompose our process into conditionals and noise. My point remains, that the number one issue is to find a parsimonious model, simply because we can expect to be wrong with most assumptions we make, and our data is too scarce to identify complicated, but general models. This is what I would understand the main critique of Bayesian parametric estimation. – IMA Oct 21 '19 at 19:08
  • The assumptions in the post/paper are likely wrong/inexact; the question is whether they are useful. To my surprise, the answer has been "yes" (at least in this case). The referred optimization method, that uses a likely wrong/misspecified Gaussian process model for an arbitrary target function (which possibly does not meet such assumptions) ends up working well _empirically_. So, to me being _somewhat_ Bayesian pays off. However, I agree that we want to have robust methods that take care of cases in which our models are horribly wrong (e.g., in the paper there is a model-free component). – lacerbi Oct 22 '19 at 13:52
  • @lacerbi "The simplest case is that today's posterior is tomorrow's prior." this can also be pointed as a disadvantage. In fact, the new posterior will depend on the amount of observable data and how this is made available. For a normal prior and new data which follows also a normal distribution, you'll find that the variance of the posterior will depend on the size of each block of new data, i.e. 10 blocks of 100 new data points will give you a different posterior than 1 block of 1000 new data points. – jpcgandre Nov 20 '20 at 13:19
  • @lacerbi "quantification of uncertainty is generally better than no uncertainty" I agree although the uncertainty of the posterior bias is probably the most important and I'm not sure how do Bayesian methods address this. – jpcgandre Nov 20 '20 at 13:28
  • @lacerbi "noisy observation y=f(x)+ε, with ε∼N(0,σ2)" the issue is that ε never has a zero mean.... – jpcgandre Nov 20 '20 at 15:09
8

Here are a few other ways of justifying Bayesian inference in misspecified models.

  • You can construct a confidence interval on the posterior mean, using the sandwich formula (in the same way that you would do with the MLE). Thus, even though the credible sets don't have coverage, you can still produce valid confidence intervals on point estimators, if that's what you're interested in.

  • You can rescale the posterior distribution to ensure that credible sets have coverage, which is the approach taken in:

Müller, Ulrich K. "Risk of Bayesian inference in misspecified models, and the sandwich covariance matrix." Econometrica 81.5 (2013): 1805-1849.

  • There's a non-asymptotic justification for Bayes rule: omitting the technical conditions, if the prior is $p(\theta)$, and the log-likelihood is $\ell_n(\theta)$, then the posterior is the distribution that minimizes $-\int \ell_n(\theta) d\nu(\theta) + \int \log\!\Big(\frac{\nu(\theta)}{p(\theta)}\Big)d\nu(\theta)$ over all distributions $\nu(\theta)$. The first term is like an expected utility: you want to put mass on parameters that yield a high likelihood. The second term regularizes: you want a small KL divergence to the prior. This formula explicitly says what the posterior is optimizing. It is used a lot in the context of quasi-likelihood, where people replace the log-likelihood by another utility function.
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Pierrot
  • 141
  • 3
8

There is the usual bias-variance tradeoff. Bayesian inference assuming M-closed case [1,2], has a smaller variance [3] but in the case of model misspecification the bias grows faster [4]. It is also possible to do Bayesian inference assuming M-open case [1,2], which has a higher variance [3] but in the case of model misspecification the bias is smaller [4]. Dicussions of ths bias-variance tradeoff between Bayesian M-closed and M-open cases appear also in some of the references included in the references below, but there is clearly need for more.

[1] Bernardo and Smith (1994). Bayesian Theory. John Wiley \& Sons.

[2] Vehtari and Ojanen (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6:142-228. http://dx.doi.org/10.1214/12-SS102

[3] Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735. http://dx.doi.org/10.1007/s11222-016-9649-y.

[4] Yao, Vehtari, Simpson, and Andrew Gelman (2017). Using stacking to average Bayesian predictive distributions. arXiv preprint arXiv:1704.02030 arxiv.org/abs/1704.02030

Aki Vehtari
  • 538
  • 4
  • 8
6

assume that the real model of the data $p_{true}(X)$ differs from $p(X|\theta)$ for all values of $\theta$

Bayesian interpretation of this assumption is that there is an additional random variable $\phi$ and a value $\phi_0$ in its range $\phi_0$ such that $\int p(X|\theta,\phi=\phi_0) \mathrm{d}\theta =0$. Your prior knowledge says $p(\phi=\phi_0)\propto 1$ and $p(\phi\neq\phi_0)=0$. Then $p(\theta|X,\phi=\phi_0)=0$ which is not proper probability distribution.

This case corresponds to a similar inference rule in logic where $A, \neg A \vdash \emptyset$, i.e. you can't infer anything from a contradiction. The result $p(\theta|X,\phi=\phi_0)=0$ is a way in which bayesian probability theory tells you that your prior knowledge is not consistent with your data. If someone failed to get this result in their derivation of the posterior, it means that the formulation failed to encode all relevant prior knowledge. As for the appraisal of this situation I hand over to Jaynes (2003, p.41):

... it is a powerful analytical tool which can search out a set of propositions and detect a contradiction in them if one exists. The principle is that probabilities conditional on contradictory premises do not exist (the hypothesis space is reduced to the empty set). Therefore, put our robot to work; i.e. write a computer program to calculate probabilities $p(B|E)$ conditional on a set of propositions $E= (E_1,E_2,\dots,E_n)$ Even though no contradiction is apparent from inspection, if there is a contradiction hidden in $E$, the computer program will crash. We discovered this ,,empirically,'' and after some thought realized that it is not a reason for dismay, but rather a valuable diagnostic tool that warns us of unforeseen special cases in which our formulation of a problem can break down.

In other words, if your problem formulation is inaccurate - if your model is wrong, bayesian statistics can help you find out that this is the case and can help you to find what aspect of the model is the source of the problem.

In practice, it may not be entirely clear what knowledge is relevant and whether it should be included in derivation. Various model checking techniques (Chapters 6 & 7 in Gelman et al., 2013, provide an overview) are then used to find out and to identify an inaccurate problem formulation.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis, Third edition. Chapman & Hall/CRC.

Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge university press.

Aki Vehtari
  • 538
  • 4
  • 8
matus
  • 528
  • 3
  • 15
  • 1
    Your answer is missing the point and considering a simpler situation. I do not consider a situation in which our model is so wrong that it is inconsistent with the data. I look at a situation in which our model is wrong, but not catastrophically so. For example, consider infering the mean of the $X_i$. You could use a Gaussian model of the $X_i$ for inference, even though the real model is Laplace. In this simple example, the model is wrong but won't "explode" like what you describe. – Guillaume Dehaene Apr 21 '17 at 07:05
  • 1
    @GuillaumeDehaene Your question was whether there are some arguments for using bayes when model is missspecified. Clearly, catastrofically misspecified model is missspecified. In addition you can't know apriori whether your model is catastrophically missspecified or just misspecified. In fact bayes can tell you precisely that, which makes it useful and my answer pointed that out. – matus Apr 21 '17 at 13:47
  • If it's not catrastrophically wrong, then the coverage won't be so different from $1-\alpha$. You could write a simulation of this normal model with Laplacian data to check this. The conceptual benefits would always be present. Think about it: if you decide to throw your posterior out of the window, you wouldn't compute just the MLE, but also some confidence interval. But we know that the interpretation of the CI computed for ONE particular experiment is rubish. So relax and enjoy the bayesian beer. If you understand that the model is misspecified, use this information to buil a better one. – Zen Apr 21 '17 at 13:50
  • @GuillaumeDehaene Yes, my answer is not exhaustive. I gladly extend it to clarify not catastrophic cases, but You need to specify what You have in mind: do You mean that $\int p(X,\theta|\phi=\phi_0) \mathrm{d}\theta =k$ where $k$ is some small number so that $p(X|\phi=\phi_0)$ is small? Or are You saying that there exists $\theta=\theta_0$ such that $p(\theta=\theta_0|\phi=\phi_0)=0$ yet $p(X,\theta=\theta_k|\phi=\phi_0)>0$ or something else? I agree with Zen that generally the posterior won't be affected much in these less severe cases, although one could construct a borderline case. – matus Apr 21 '17 at 14:03
5

The MLE is still an estimator for a parameter in a model you specify and assume to be correct. The regression coefficients in a frequentist OLS can be estimated with the MLE and all the properties you want to attach to it (unbiased, a specific asymptotic variance) still assume your very specific linear model is correct.

I'm going to take this a step further and say that every time you want to ascribe meaning and properties to an estimator you have to assume a model. Even when you take a simple sample mean, you are assuming the data is exchangeable and oftentimes IID.

Now, Bayesian estimators have many desirable properties that an MLE might not have. For example, partial pooling, regularization, and interpretability of a posterior which make it desirable in many situations.

Zen
  • 21,786
  • 3
  • 72
  • 114
TrynnaDoStat
  • 7,414
  • 3
  • 23
  • 39
4

I recommend Gelman & Shalizi's Philosophy and the practice of Bayesian statistics. They have coherent, detailed and practical responses to these questions.

We think most of this received view of Bayesian inference is wrong. Bayesian methods are no more inductive than any other mode of statistical inference. Bayesian data analysis is much better understood from a hypothetico-deductive perspective. Implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist orientation. Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense.

We proceed by a combination of examining concrete cases of Bayesian data analysis in empirical social science research, and theoretical results on the consistency and convergence of Bayesian updating. Social-scientific data analysis is especially salient for our purposes because there is general agreement that, in this domain, all models in use are wrong – not merely falsifiable, but actually false. With enough data – and often only a fairly moderate amount – any analyst could reject any model now in use to any desired level of confidence. Model fitting is nonetheless a valuable activity, and indeed the crux of data analysis. To understand why this is so, we need to examine how models are built, fitted, used and checked, and the effects of misspecification on models.

...

In our view, the account of the last paragraph [of the standard Bayesian view] is crucially mistaken. The data-analysis process – Bayesian or otherwise – does not end with calculating parameter estimates or posterior distributions. Rather, the model can then be checked, by comparing the implications of the fitted model to the empirical evidence. One asks questions such as whether simulations from the fitted model resemble the original data, whether the fitted model is consistent with other data not used in the fitting of the model, and whether variables that the model says are noise (‘error terms’) in fact display readily-detectable patterns. Discrepancies between the model and data can be used to learn about the ways in which the model is inadequate for the scientific purposes at hand, and thus to motivate expansions and changes to the model (Section 4.).

2

I think you're describing an impact of model uncertainty - you worry that your inference about an unknown parameter $x$ in light of data $d$ is conditional upon a model, $m$, $$ p (x|d, m), $$ as well as the data. What if $m$ is an implausible model? If there exist alternative models, with the same unknown parameter $x$, then you can marginalize model uncertainty with Bayesian model averaging, $$ p (x|d) = \sum_m p (x|d, m) p(m|d) $$ though this is a functional of the models considered and their priors.

If , on the other hand, the definition of parameter $x$ is intrinsically tied to the model $m$, such that there are no alternatives, it's hardly surprising that inferences about $x$ are conditional on $m$.

innisfree
  • 1,124
  • 6
  • 23
  • 3
    Model averaging can't save us: it is still foolish to assume that the true model somehow neatly falls inside the scope of our bigger model. With model comparison, we can determine which of several models gives the best account of the data, but this just returns a wrong model that is less wrong than the other models. – Guillaume Dehaene Apr 21 '17 at 07:07
  • It can help you make inferences/estimates about an unknown quantity that coherently incorporate model uncertainty. It cannot invent new hypotheses for you, though. If there were a statistical machinery that invented models in light of data, e.g. science would be much easier. – innisfree Apr 21 '17 at 14:48
1

How do you define what a "mis-specified" model is? Does this mean the model...

  • makes "bad" predictions?
  • is not of the form $p_{T}(x) $ for some "true model"?
  • is missing a parameter?
  • leads to "bad" conclusions?

If you think of the ways a given model could be mis-specified, you will essentially be extracting information on how to make a better model. Include that extra information in your model!

If you think about what a "model" is in the bayesian framework, you can always make a model that cannot be mis-specified. One way to do this is by adding more parameters to your current model. By adding more parameters, you make your model more flexible and adaptable. Machine Learning methods make full use of this idea. This underlies things like "nueral networks" and "regression trees". You do need to think about priors though (similar to regularising for ML).

For example, you have given the "linear model" as your example, so you have...
$$\text {model 1: }x_i =\theta + \sigma e_i $$ Where $e_i \sim N (0,1)$. Now suppose we add a new parameter for each observation.... $$\text {model 2: }x_i =\theta + \sigma \frac{e_i}{w_i} $$
Where $e_i \sim N (0,1)$ as before. How does this change things? You could say "model 1 is mis-specified if model 2 is true". But model 2 is harder to estimate, as it has many more parameters. Also, if information about $\theta $ is what we care about, does it matter if model 1 is "wrong"?

If you assume that $w_i\sim N (0,1) $ (like a "model 2a") then we basically have "cauchy errors" instead of "normal errors" and the model expects outliers in the data. Hence, by adding parameters to your model, and choosing a prior for them, I have created a "more robust model". However the model still expects symmetry in the error terms. By choosing a different prior, this could be accounted for as well...

probabilityislogic
  • 22,555
  • 4
  • 76
  • 97
  • And the more parameters you use, the more data you need. If the information in $x$ about $f(x)$ is scarce, then adding parameters will not help. With new data, the DGP is even less constant, so you again need more parameters and so forth. The more general your model (more parameters), the less likely it is "mis-specified", but the more data you need to estimate. In contrast, the less you ask of your model, the less data you need. But that means in reality, how "right" is likely the model if a full posterior versus, say, a conditional moment? – IMA Oct 21 '19 at 11:57