7

There is a kind of simulation study that is commonly used to validate an implementation of a Bayesian model:

  • For independent replication $i = 1, ..., n$:
    1. Draw a set of "true" parameters parameters from the joint prior.
    2. Draw a dataset from the likelihood given the parameter draws from (1).
    3. Approximate the full joint posterior distribution, e.g. with MCMC or variational inference.
    4. For each parameter (index $p$) let $c_{ip}$ = 1 if the $100(1 - \alpha)$% posterior interval covers the prior predictive draw from (1). Otherwise, $c_{ip}$ = 0.
  • For each parameter $p$, calculate coverage: $C_p = \frac{1}{n} \sum_{i = i}^n c_{ip}$. If $C_p < 1 - \alpha$, then there are problems in the model or the software.

This technique is super useful in my team's work, and it has caught a lot of errors. Does anyone know if it has an official name? I have been searching but have been unable to find it. At first I thought it was called "simulation-based calibration", but what I am describing does (4) above instead of the calibration part.

References

  • Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, & Martin Modrák. (2020). Bayesian Workflow. https://arxiv.org/abs/2011.01808

  • Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. http://www.jstor.org/stable/27594203.

  • Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” http://arxiv.org/abs/1804.06788.

landau
  • 285
  • 1
  • 8
  • 1
    Have you considered "posterior predictive checks"? – svendvn Mar 09 '21 at 23:54
  • I often do when feasible, but this particular simulation does not use the posterior predictive distribution (only the marginal posterior of each parameter). “Posterior predictive checks” and “posterior checks” sound a bit too general for this. – landau Mar 09 '21 at 23:57
  • Also, I would like to find the name that is already widely used in the community, rather than try to invent a name myself – landau Mar 09 '21 at 23:59
  • Never heard of it. – Xi'an Mar 10 '21 at 07:12
  • Whilst I have heard of people investigating the frequentist properties of credible regions, CRs offer no coverage guarantees. And so I can’t see why you’d conclude a bug in software or model if it didn’t have a particular coverage. – innisfree Mar 10 '21 at 15:43
  • @innisfree I see your point about frequentism. However, it isn't all that different from actual SBC. http://www.jstor.org/stable/27594203 section 2 paragraph 1 explicitly claims their quantile method generalizes what I described, and SBC in https://arxiv.org/abs/2011.01808 generalizes further. All 3 approaches take independent draws from the prior predictive distribution and approximate the posterior for each prior predictive draw. And all 3 approaches compare posterior quantiles to prior predictive draws from simulations. – landau Mar 11 '21 at 11:11
  • No. What they describe in Sec. 2, is more like an average coverage. Averaged over the prior. Which agrees with the amount of probability in the CR – innisfree Mar 11 '21 at 15:30
  • It does not say that CR has correct coverage for every choice of possible parameter. – innisfree Mar 11 '21 at 15:31
  • Proof of average property: just write the joint as $p(x,D|M) = p(x|D,M) p(D|M)$. This is distribution from which we’re sampling true parameters and data. Then clear that in every simulation, for whatever $D$ you draw, since $p(x|D,M)$ is the posterior, there’s a X% chance you get a draw that lies the X% CR – innisfree Mar 11 '21 at 15:34
  • Actually, re-reading, I now find your question ambiguous: do you test for correct average coverage? Or correct coverage for each possible true value? – innisfree Mar 11 '21 at 15:37
  • I calculate coverage as an average over all prior predictive draws. For any individual prior predictive draw on its own, all I calculate is hit or miss. In other words, I was thinking of coverage itself as an average over the prior. (If I understand correctly, the issue you just raised seems like what section 2 of https://arxiv.org/pdf/1804.06788.pdf is talking about.) – landau Mar 11 '21 at 15:44
  • Thinking out loud: if a rank statistic in SBC is itself a kind of coverage, the aggregate does seem to be a kind of "average coverage". – landau Mar 11 '21 at 15:54
  • Just edited the original post to try to be clearer. – landau Mar 11 '21 at 19:33

0 Answers0