9

Kevin Murphy's book discusses a classical Hierarchical Bayesian problem (originally discussed in Johnson and Albert, 1999, p24):

Suppose that we are trying to estimate the cancer rate in $N$ cities. In each city, we sample a number of individuals $N_i$ and measure the number of people with cancer $x_i \sim \text{Bin}(N_i, \theta_i)$, where $\theta_i$ is the true cancer rate in the city.

We would like to estimate the $\theta_i$'s while allowing the data-poor cities to borrow statistical strength from data-rich cities.

To do so, he models $\theta_i \sim \text{Beta}(a,b)$ so that all cities share the same prior, so the final models look as follows:

$p(\mathcal{D}, \theta, \eta|N)=p(\eta)\prod\limits^N_{i=1}\text{Bin}(x_i|N_i, \theta_i)\text{Beta}(\theta_i|\eta)$

where $\eta = (a,b)$.

The crucial part about this model is of course (I quote), "that we infer $\eta=(a,b)$ from the data, since if we just clamp it to a constant, the $\theta_i$ will be conditionally independent, and there will be no information flow between them".


I am trying to model this in PyMC, but as far as I understand, I need a prior for $a$ and $b$ (I believe this is $p(\eta)$ above).

What would be one good prior for this model?

In case it helps, the code, as I have it now is:

bins = dict()
ps   = dict()
for i in range(N_cities):
    ps[i]   = pm.Beta("p_{}".format(i), alpha=a, beta=b)
    bins[i] = pm.Binomial('bin_{}'.format(i), p=ps[i],n=N_trials[i],  value=N_yes[i], observed=True)

mcmc = pm.MCMC([bins, ps])
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110

1 Answers1

10

A similar problem is discussed in Gelman, Bayesian Data Analysis, (2nd ed, p. 128; 3rd edition p. 110). Gelman suggests a prior $p(a,b)\propto (a+b)^{-5/2}$, which effectively constrains the "prior sample size" $a+b$, and therefore the beta hyperprior is not likely to be highly informative on its own. (As the quantity $a+b$ grows, the variance of the beta distribution shrinks; in this case, smaller prior variance constrains the "weight" of the observed data in the posterior.) Additionally, this prior does not set whether $a>b$, or the opposite, so appropriate distributions of pairs of $(a,b)$ are inferred from all the data together, as you would prefer in this problem.

Gelman also suggests reparameterizing the model in terms of the logit of the mean of $\theta$ and the "sample size" of the prior. So instead of doing inference on $(a,b)$ directly, the problem is about inference on the transformed quantities $\text{logit}\left(\frac{a}{a+b}\right)$ and $\log(a+b)$. This admits transformed prior values on the real plane, rather than untransformed prior values that must be strictly positive. Also, this accomplishes a posterior density that is more diffuse when plotted. This makes the accompanying graphs more legible, which I find helpful.

Sycorax
  • 76,417
  • 20
  • 189
  • 313