Latent Dirichlet Allocation in PyMC

Question

As an exercise to improve my skills in PyMC (Python's Markov chain Monte Carlo library), I am trying to implement Latent Dirichlet Allocation as described here: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation.

The model can be compactly described as

$$\boldsymbol\phi_{k=1 \dots K} \sim \operatorname{Dirichlet}_V(\boldsymbol\beta)\\ \boldsymbol\theta_{d=1 \dots M} \sim \operatorname{Dirichlet}_K(\boldsymbol\alpha)\\ z_{d=1 \dots M,w=1 \dots N_d} \sim \operatorname{Categorical}_K(\boldsymbol\theta_d) \\ w_{d=1 \dots M,w=1 \dots N_d} \sim \operatorname{Categorical}_V(\boldsymbol\phi_{z_{dw}}) $$

I came up with the following toy code:

import numpy as np
import pymc as pm

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V+1)

theta = pm.Container([pm.Dirichlet("theta_%s" % i, theta=alpha) for i in range(D)])
phi = pm.Container([pm.Dirichlet("phi_%s" % k, theta=beta) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d, 
                             p = theta[d], 
                             size=Wd[d],
                             value=np.random.randint(K, size=Wd[d]),
                             verbose=1)
              for d in range(D)])


w = pm.Container([pm.Categorical("w_%i" % d,
                             p = pm.Lambda('phi_z_%i' % d, lambda z=z, phi=phi: [phi[z[d][i]] for i in range(Wd[d])]),
                             value=data[d], 
                             observed=True, 
                             verbose=1)
              for d in range(D)])

model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100, burn=10)

The tricky part is within $w_{d=1 \dots M,w=1 \dots N_d} \sim \operatorname{Categorical}_V(\boldsymbol\phi_{z_{dw}})$. Given the output of the sampling, I must be doing something wrong, because the model does not converge and I get many warnings about the probabilities in categorical_like that do not sum to one.

Is there a PyMC expert around who can shed some light on this all?

A couple of simplifications: 1. You don't have to explicitly wrap lists of variables in Container objects; PyMC will do this automatically. 2. You don't have to create a Model object; just pass your variables (or easier, just the locals()) to MCMC directly. — fonnesbeck, Jul 18 '15 at 16:17
Please see extended discussion at [this question](https://stats.stackexchange.com/q/349638/8927) because there appear to be some fundamental errors in this formulation of LDA with pymc. — ely, Jun 11 '18 at 15:52

score 8 · Accepted Answer · answered Oct 17 '14 at 14:29

8

When defining w, the p parameter must be a list of doubles, not a list of lists of doubles. This means you have to define a w variable for each word in each document. Also it helps to 'complete' the Dirichlet variables using the CompletedDirichlet function. Here is the working code:

import numpy as np
import pymc as pm

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d, 
                     p = theta[d], 
                     size=Wd[d],
                     value=np.random.randint(K, size=Wd[d]))
                  for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i),
                    p = pm.Lambda('phi_z_%i_%i' % (d,i), 
                              lambda z=z[d][i], phi=phi: phi[z]),
                    value=data[d][i], 
                    observed=True)
                  for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100)

answered Oct 17 '14 at 14:29

Tom Minka

6,610
1
22
33

Hi, it seems that this simple example does not converge when using the default MCMC inferencer (without modifying step-functions). I have tried it with different sample (100, 1000, 500000), burn-in (0, 500, 5000) and thinning (0, 10) parameters as well as different priors for alpha and beta. But when looking at the traces of the topic-word-distributions "phi_0" and "phi_1", I don't seem to be able to get stable results. Do you have an idea why this might happen? – Martin Becker Dec 05 '14 at 14:19
This might be aided by using an AdaptiveMetropolis sampler, which can provide better mixing for correlated stochastics. – fonnesbeck Jul 18 '15 at 16:20
2

Thanks for this answer. I have tried to implement your solution and have done some preliminary analysis on the Inaugural Speech corpus. My analysis can be seen at: https://github.com/napsternxg/ipython-notebooks/blob/master/PyMC_LDA.ipynb – Shubhanshu Mishra Aug 03 '15 at 01:17
Please see extended discussion at [this question](https://stats.stackexchange.com/q/349638/8927) because there appear to be some fundamental errors in this formulation of LDA with pymc. Both with this answer, and also the above-linked notebook. The bag of words formulation of the data is in error in terms of how it connects to the LDA likelihood terms for generating words, and the bag of words count data is being improperly treated as a *category label* in the `Categorical` distribution for the `w` terms with `observed=True`. – ely Jun 11 '18 at 15:53
Could not found pm.Dirichlet – ZHU Aug 30 '18 at 18:34

Latent Dirichlet Allocation in PyMC

1 Answers1

Linked