What is the intuition behind assuming a variable satisfies a distribution?

Question

I'm a new to the machine learning field and please forgive me if this question is trival.

In a famous paper: M. Hoffman, D. Blei, and P. Cook, Bayesian nonparametric matrix factorization for recorded music. ICML 2010 (PDF: link , page 2, left column ), the authors assume that the elements in the latent submatrices $ W $ and $H$ satisfy a gamma distributions:

$W_{ml} \sim Gamma(a, a), $ $H_{ln} \sim Gamma(b, b) $

I was wondering why we should assume these latent variables are drawn from a gamma distribution? What is the reason the authors choose the gamma distribution instead of another?

My guess is to make for a very weakly informative prior. The gamma distribution can take on *many* shapes. This means that by learning the hyperparameters $a$ and $b$, the resulting matrix $\mathbf{X}$ can take on many different values. Note that you don't *have* to assume a gamma distribution, but instead the authors are trying to convey that using Gamma Process Nonnegative Matrix Factorization *implies* assuming gamma distributed submatrices. — Frans Rodenburg, Oct 31 '17 at 03:27
@FransRodenburg Thanks for your reply, this does make sense. BTW, is it somehow related to the usage of generalized inverse Gaussian as the prior during the variational inference? — ice_lin, Oct 31 '17 at 12:06

score 2 · Accepted Answer · answered Nov 01 '17 at 14:58

In general, distributions in these sorts of graphical models are chosen for support that suits the data, along with simplifying assumptions for computational convenience. Here, the authors say this much:

Unlike other BNP factorization methods, our model is not composed of conjugate pairs of distributions—we chose our distributions to be appropriate for spectrogram data, not for computational convenience.

The gamma has support on the positive reals, making it reasonable for $\textbf{W}$ and $\textbf{H}$ (see paragraphs beginning paper section 2). As to how why they chose gamma over other distributions with the same support, say the lognormal, unsure. (For the differences between the two, see answers to this question.) But, the GIG later used in variational inference has parameters that are a superset of the gamma, and the authors say this:

Since both $y$ and $1/y$ are sufficient statistics of $GIG(y; \gamma, \rho, \tau )$, this will not pose a problem during inference, as it would if we were to use variational distributions from the gamma family.

Meaning, using GIG allows for convenient updates in the inference algorithm when applied to the $p(\textbf{X}|\textbf{W}, \textbf{H}, \theta)$ term. I'd wager that choosing gamma over lognormal also makes for much cleaner math when deriving the updates for $\mathcal{L}$; confirmation left as an exercise.

What is the intuition behind assuming a variable satisfies a distribution?

1 Answers1