Posterior predictive checks and the real world

Question

In general, when performing posterior predictive checks, one calculates a posterior predictive p-value like so: $$p_B = \frac{1}{S}\sum_{s=1}^{S}\mathbb{1} (T(x^{(rep,s)},\theta^{s}) \ge T(x,\theta^{s}))$$ for some test quantity $T(x,\theta)$ and replicate datasets $x^{rep,s}$.

I understand that this method is only used for probing the model for areas of weakness and that using the data twice is a necessary evil. See here for more details. That's not my problem however.

The whole idea is that we can make inferences on our model based on the outcome (ideally $p_B \approx$ 0.5) which depends on our test quantity/statistic. And while we shouldn't select a $T(X)$ that is a sufficient statistic for the model (like the mean or variance for exponential families), who is to say that the $T(X)$ we select is any good at telling us anything useful about the model?

For example, say my test statistic is $Min(X)$. A poor test statistic can only tell me that my model does not capture this aspect of the data well, based on the prior and likelihood that I set up. Then I could tweak my likelihood to somehow result in a better posterior predictive check. But then I'd be potentially overfitting. And what's more, overfitting to some test statistic that I think is a good one.

It seems to me to be a pretty complex and convoluted approach to model diagnostics, compared to cross validation and sensitivity analysis. Outside of academia and tightly controlled settings with well specified Bayesian models, do practitioners actually use posterior predictive checks? Anyone with real world use cases where this approach was helpful? I'm happy to use them in daily analysis because I strongly believe in model checking but this seems like a hassle for little insight gained.

Sounds like you have a strawman: "If I choose a poor test statistic, it will aim me in the wrong direction. So why should I choose any test statistic at all?" The Posterior Predictive Check is -- as I understand it -- to see if your predictions fit your data in ways that are important to you. There are visual checks and quantitative checks, and if you're doing a quantitative check it does pay to actually choose a test statistic that tells you something about the fit that's important to you. Or do I misunderstand the question? — Wayne, Feb 18 '16 at 19:38
Right. But what if the test statistic was appropriate for describing some phenomenon (or so you think) but the model is misspecified? I feel like it wouldn't be obvious what the culprit is would it? On some level it appears to be circular. Test statistic -> misspecified model -> test statistic... — ilanman, Feb 18 '16 at 19:43
@ilanman but it is the same with *every* other statistical test or criteria... — Tim, Feb 18 '16 at 21:30
@ilanman: Are you thinking you'd not see any problem, or you'd see a problem and not know whether it's the test or the model? As long as you'd see an issue, you then can investigate the details. If you think everything would look fine because you have a poor model and a poor test statistic and the poor test statistic would either hide the fact that the model is poor or would actually cancel the effect, that's another thing. But you don't depend on just a single test. — Wayne, Feb 18 '16 at 21:41
Yeah, the latter. See a problem and not know what caused it. I know that this is an iterative process and real problems are messy. But it appears to me that in prioritizing how to diagnose/tweak your model (CV, model averaging), posterior predictive checks probably fall low on the list. — ilanman, Feb 18 '16 at 21:45

Wayne · Accepted Answer · 2016-02-21T02:01:49.603

Now that you've clarified in comments that you are worried about discovering a problem with a posterior predictive check but not knowing where to attribute the problem, I'll try to gather stuff into an answer instead of a long comment chain...

In the first place, I'd emphasize that posterior predictive checks are an additional layer of checks on top of what you already use -- or at least an additional option. Additional layers are always a win. You can compare what your model predicts versus what you would expect (i.e. the data you have) in a variety of ways, depending on what's important to you.

You can do this visually with plots, or you can do this quantitatively. As you say, it takes some thought to come up with statistics that reflect your concerns, but that's true from the first step of any analysis: what analysis will support the question you are asking of your data?

Second, the other things you mention can also fool you. I see all the time people who say they're using "CV" who are using it improperly. They're tweaking on their models -- making it a model selection problem -- but they're using CV within model fitting rather than around their entire process. When you're model selecting/comparing/tweaking you need to use nested CV and lots of people don't get that.

Third, some things people tend to look at with models are a reflection of Model World, not Real World. They are the model's opinion, as it were: they are based on the model being properly specified, among other things. So your question isn't unique to posterior predictive checks, and in fact having that additional, unique check is an important benefit. (This is related to the overfitting you fear: if your model seems to predict well, but it doesn't predict in a way that looks like what you'd expect, it's probably overfitted.)

Fourth, I'd disagree with your characterization that it's complex and convoluted. In fact, it's very straightforward: does my (generative) model predict things that act and look as I'd expect?

Last, don't forget that there are other permutation/sampling checks that should also be used on a model you care about. And more tests are better, since it's so easy to fool ourselves.

Posterior predictive checks and the real world

1 Answers1

Linked