Multinomial-Dirichlet model with hyperprior distribution on the concentration parameters

Question

I will try to describe the problem at hand as general as possible. I am modeling observations as a categorical distribution with a parameter probability vector theta.

Then, I assume the parameter vector theta follows a Dirichlet prior distribution with parameters $\alpha_1,\alpha_2,\ldots,\alpha_k$.

Is it then possible to also impose a hyperprior distribution over the parameters $\alpha_1,\alpha_2,\ldots,\alpha_k$? Will it have to be a multivariate distribution such as the categorical and dirichlet distributions? Seems to me the alpha's are always positive so a gamma hyperprior should work.

Not sure if anyone has tried fitting such (possibly) overparametrized models but seems reasonable to me to think that the alpha's should not be fixed but rather come from a gamma distribution.

Please try to provide me with some references, insights on how I could try such approach in practice.

Yes, this is possible and it has been done. In general this is called Bayesian hierarchical model. Preferably, this prior should account for possible dependencies. — , Nov 21 '12 at 20:45
@Procrastinator thanks. do you have some reference for good Bayesian hierarchical models dealing with this kind of models? thanks. — Dnaiel, Nov 21 '12 at 20:49
@Procrastinator: Have you manged to get any papers/report or ideally hands-on application documents regarding Bayesian Hierarchical Models? — Zhubarb, Nov 07 '13 at 13:09

jerad · Accepted Answer · 2012-11-29T19:10:55.047

I don't think this is an "overparamaterized" model at all. I would argue that by placing a prior over the Dirichlet paramaters, you're being less committal about any particular outcome. In particular, as you probably know, for symmetric dirichlet distributions (i.e. $\alpha_1 = \alpha_2 = ... \alpha_K$) setting $\alpha<1$ gives more prior probability to sparse multinomial distributions, while $\alpha>1$ gives more prior probability to smooth multinomial distributions.

In cases where one has no strong expectation for either sparse or dense multinomial distributions, placing a hyperprior over your Dirichlet distribution gives your model some added flexibility to chose between them.

I originally got the idea of doing this from this paper. The hyperprior they use is slightly different than what you suggest. They sample a probability vector from a dirichlet and then scale it by a draw from an exponential (or gamma). So the model is \begin{eqnarray} \beta &\sim &Dirichlet(1)\\ \lambda& \sim &Exponential(\cdot)\\ \theta& \sim &Dirichlet(\beta\lambda) \end{eqnarray}

The extra Dirichlet is simply to avoid imposing symmetry.

I've also seen people use just the Gamma hyper prior for a Dirichlet in the context of hidden markov models with multinomial emission distributions, but I can't seem to find a reference. Also, it seems like I've encountered similar hypers used in topic models.

Thanks great answer! I have a one short follow-up Q, will this model allow for different variability for each of the thetas? I have this question since the parameter lambda is shared across all thetas, therefore they all share the same scaling parameter so I was wondering in the case of overdispersion the model would provide such flexibility. Your intuition/knowledge here is greatly appreciated! thanks! — Dnaiel, Nov 30 '12 at 15:21
@Dnaiel, tell me if I'm misunderstanding your question, but yes even with symmetric dirichlet prior, say $Dirichlet(0.2, 0.2, 0.2, 0.2)$, draws from that distribution will tend to produce sparse $\theta$ vectors. By sparse i mean if you were to plot the vector $\theta$ as a histogram it would be very peaky, rather than flat. In the model above the Dirichlet paramaters are not symmetric due to the $\beta$ paramater being drawn from a dirichlet hyperprior. — jerad, Nov 30 '12 at 15:55

score 5 · Answer 2 · answered Oct 04 '16 at 19:33

To demonstrate a solution to this hyperprior problem, I implemented an hierarchical gamma-Dirichlet-multinomial model in PyMC3. The gamma prior for the Dirichlet is specified and sampled per Ted Dunning's blog post.

The model I implemented can be found at this Gist but is also described below:

This is a Bayesian hierarchical (pooling) model for movie ratings. Each movie can be rated on a scale from zero to five. Each movie is rated several times. We want to find a smoothed distribution of ratings for each movie.

We are going to learn a top-level prior distribution (hyperprior) on movie ratings from the data. Each movie will then have its own prior that is smoothed by this top-level prior. Another way of thinking about this is that the prior for ratings for each movie will be shrunk towards the group-level, or pooled, distribution.

If a movie has an atypical rating distribution, this approach will shrink the ratings to something more in-line with what is expected. Furthermore, this learned prior can be useful to bootstrap movies with few ratings to allow them to be meaningfully compared to movies with many ratings.

The model is as follows:

$\gamma_{k=1...K} \sim Gamma(\alpha, \beta)$

$\theta_{m=1...M} \sim Dirichlet_M(c\gamma_1, ..., c\gamma_K)$

$z_{m=1...M,n=1...N_m} \sim Categorical_M(\theta_m)$

where:

$K$ number of movie rating levels (e.g. $K = 6$ implies ratings 0, ..., 5)
$M$ number of movies being rated
$N_m$ number of ratings for movie $m$
$\alpha = 1 / K$ in order to make the collection of gamma r.v.s act as an exponential coefficient
$\beta$ rate parameter for the exponential top-level prior
$c$ concentration parameter dictating the strength of the top-level prior
$\gamma_k$ top-level prior for rating level $k$
$\theta_m$ movie-level prior for rating levels (multivariate with dimension = $K$)
$z_{mn}$ rating $n$ for movie $m$

score 1 · Answer 3 · answered Nov 30 '12 at 14:56

1

This is a direct Bayesian conjugate prior modeling. A natural extension from Beta-Binomial model. A good resource for this could be from the book. And Posterior is also Dirichlet and hence simulating from dirichlet will give necessary summaries

answered Nov 30 '12 at 14:56

Subbiah

19
1

1

thanks. I am familiar with such book, great reference. I tried looking into it but they do not provide such multinomial hierarchical model directly, but they do have tons of good ideas that can be applied. – Dnaiel Nov 30 '12 at 15:26
2

The dirichlet-multinomial is a conjugate model, but the op inquired about a (hyper-) prior on the parameters of the Dirichlet. There's no standard conjugate prior for the Dirichlet distribution, although one must [in fact exist](http://andrewgelman.com/2009/04/conjugate_prior/), as it's a member of the exponential family. – jerad Dec 01 '12 at 03:41

Multinomial-Dirichlet model with hyperprior distribution on the concentration parameters

3 Answers3

Linked