Understanding the role of document size parameters in Latent Dirichlet Allocation

Question

I am writing a pymc3-based implementation of Latent Dirichlet Allocation, and am referencing this CrossValidated answer (modified for pymc3) as well as pymc3's own tutorial on LDA, in addition to the Wikipedia article on LDA.

But I am hitting a problem conceptually.

In the linked CrossValidated answer above, the final distribution for all of the word-position-specific Categorical (multinomial) outputs looks like this:

[
    pm.Categorical(
         "w_%i_%i" % (d, i),
          p=pm.Lambda(
              'phi_z_%i_%i' % (d, i), 
              lambda z=z[d][i], phi=phi: phi[z]
          ),
          value=data[d][i], 
          observed=True
    )
    for d in range(D) for i in range(Wd[d])
]

where z and phi are defined just as in the Wikipedia formulation of LDA, data[d] is a vector of word counts (bag of words vector) for document number d, D is the number of documents, and Wd[d] is the document length of document d.

You don't need to dig around for what every variable name means. The problem I have is that the value parameter is set to data[d][i], for this observed=True random variable -- meaning whatever is in data[d][i] will define how the likelihood is penalized for matching the observed value, in tun affecting the gradient for choosing proposals for samples to draw, etc.

There are two approaches you could take to feeding the data into the model. First, which seems standard in both the sklearn LDA and the pymc3 'official' tutorial linked above, is to use a standard bag-of-words representation, so that data[d][i] would be the count of word i appearing in document d.

But the nested for loop of the model above shows that there is a Categorical parameter for each (document, position) pair, and not for each (document, word index) pair.

In fact, if a given document happened to have more word positions in it than there are total words in the vocabulary, then the indexing data[d][i] might even cause an indexing error, because the data are counts for each word, and are not indexed by the position within the document.

This would be more in line with the second way of feeding in data: as a sequence of word index values, where the sequence has an entry for each position in a document, so its length is the length of the document.

In more mathematical terms, this is the set of distributional relationships defining the graphical model for LDA:

$$\boldsymbol\phi_{k=1 \dots K} \sim \operatorname{Dirichlet}_V(\boldsymbol\beta)\\ \boldsymbol\theta_{d=1 \dots M} \sim \operatorname{Dirichlet}_K(\boldsymbol\alpha)\\ z_{d=1 \dots M,w=1 \dots N_d} \sim \operatorname{Categorical}_K(\boldsymbol\theta_d) \\ w_{d=1 \dots M,w=1 \dots N_d} \sim \operatorname{Categorical}_V(\boldsymbol\phi_{z_{dw}}) $$

and the parameter $N_{d}$ is the length of document $d$. So in this setting, the parameters clearly should correspond to all of the word positions, and not to the word counts.

Over in the linked pymc3 tutorial, a custom log-likelihood function is written that accounts for this, by multiplying the count by the log likelihood on the other conditional terms.

But then this makes it seem that pymc3's tutorial conflicts with the graphical model from Wikipedia and from the linked CrossValidated answer: those both assume that the output prediction is a Categorical selection of a word index for each position in each document.

I hope it's clear what the conflict is. It looks to me like a problem where you "can't have it both ways."

You cannot both input the data in a bag-of-words format (where count values are used as the observed values in the likelihood function), and also have a generative model for each (document, position) word location.

You could either use count data as the input and then generate predictions about the occurrence of each word in the vocabulary as output (like unscaled bag-of-words vectors)...

Or, you could provide the observed word sequences (not counts), and then have a generative model that matches the Wikipedia and CrossValidated links, predicting a word for each (document, index) position.

If my either-or statement is correct, then does this mean the code from linked CrossValidated answer is erroneous when it uses value=data[d][i] (the word count of vocbulary position i in document d) as the observational data?

Under that CrossValidated code, shouldn't the observed value parameter be the word index for the word that actually occurs in document d at position i? Feeding in bag-of-words data to that particular observational model is wrong?

The difference seems to me to be no more than the difference between a Binomial distribution where you have $n, \sum x_i$ as the data and a Binomial distribution where you have $x_1, \dots, x_n$ as data. It's the same, just summarized differently. The key is noticing that the probability of each word at each position is the same regardless of the position and does not depend on what other words may have been seen at other positions. But I haven't spent enough time on it to be sure. — jbowman, Jun 11 '18 at 03:20
I agree that the actual formula you end up with, assuming independence of the word order, would be the same. But critically, the *indexing* would not be the same and could in fact cause problem. If the likelihood function is meant to contain a term for each (document #, word position) pair, this might not be congruent with data that starts out in bag of words format, which would mean each document vector is a fixed size (the vocabulary size). In that case, how would $N_{d}$ factor in to the final $w_{d, w}$ likelihood terms? — ely, Jun 11 '18 at 14:19
The best I can come up with on my own is that the $w_{d, w}$ likelihood term *will* contain terms for each (document #, word position) pair, but that the *count data* for that word will just be *repeated* for each term it appears in. As in, there will be a likelihood term for (document 41, word 37 from the start), $w_{41, 37}$. This is *not* the 37th word of the vocabulary, but the 37th word of document 41. I believe the likelihood *should be* just an indicator vector with a count of 1 for whatever true word appears at position 37 in document 41. This is what we observe. — ely, Jun 11 '18 at 14:23
But it looks like in ever formulation I can find, instead of an indicate variable with a count of 1 in the vector component corresponding to the true word's location in the vocabulary, the count is just inflated to be the bag of words count for that word in that document -- which will act like a multiplier on the Categorical probability for that term (even though the other counts were observed in other positions of the document). That's fair enough, but it could lead to a nasty indexing problem. — ely, Jun 11 '18 at 14:25
Particularly, it looks to me like in the example pymc3 code above, the question I called out about using `data[d][i]` has to be wrong. I think instead, you have to use `data[d][vocab_index[original_document[d][i]]]`, where `vocab_index` is a hash table mapping a word to an index in the vocabulary, and `original_document[d]` is the actual sequence of words (not the counts as are stored in `data`). — ely, Jun 11 '18 at 14:29
Otherwise, suppose I am incorrect, then why would `data[d][i]` (which is the *count* of vocabulary entry `i` inside document `d` because `data` is bag-of-words) be used for the likelihood expression for $w_{d, i}$? Since `i` can range up to `N_{d}` (the number of total words in document `d`), in theory then, `i` can be larger than the size of the vocabulary itself, and indexing `data[d][i]` could be out of range for indexing a bag-of-words vector. And regardless, the indexing would still be *wrong* even when guaranteed not out of range. Unless I am really missing something? — ely, Jun 11 '18 at 14:31
Why would `[i]` be larger than the vocabulary? The range of `i` is in `1 to V`, **NOT** `1 to N_d`. That is the source of your confusion! Before you set up your topic model, you need to specify what `V` is, by finding the unique words in your document corpus. The topic distributions are V-dimensional discrete distributions. — kedarps, Jun 11 '18 at 14:52
@kedarps Where are you seeing that the range of `i` is to `V` and not to `N_d`? Note that `i` corresponds to the subscript `w`, as in $w_{d, w}$. And this index goes up to `N_d`. If this index `i` went only to the voculary size `V`, it would mean the model is totally different than the set of generative distributions in the Wikipedia description. — ely, Jun 11 '18 at 14:56
I anticipated this point in my original question, when I said, "You could either use count data as the input and then generate predictions about the occurrence of each word in the vocabulary as output (like unscaled bag-of-words vectors)...". That 'unscaled bag of words' would be the V-dimensional distribution over words for document `d`, but then it still wouldn't explain why the model defines $N_d$ *distinct* $w_{d, w}$ terms, each of which follows a distribution drawn from the latent topic model (the $z$ terms). If you drew `V` of the $w_{d, w}$ terms, I would agree, but the model is $N_d$. — ely, Jun 11 '18 at 15:19
@ely I indeed misread your comments. I didn't realize that you were referring to `i` in the above code. Have you checked [this example](https://github.com/napsternxg/ipython-notebooks/blob/master/PyMC_LDA.ipynb) using PyMC? I haven't used PyMC, so I will try to use it and see if I can help you :) — kedarps, Jun 11 '18 at 15:31
@kedarps The code of that linked notebook looks like it was copy/pasted from the LDA example I linked above (contained in another Cross Validated question), especially since they both use `pm.CompletedDirichlet` in the same exact ways, even though this has been deprecated in newer versions of pymc. That linked example still suffers the same problem. Note how the term they use `Wd[d]` is supposed to be the document length of document `d`, but since they already provide the data in bag of words form, it's artificially equal to the vocabulary size for all documents (which is incorrect). — ely, Jun 11 '18 at 15:35
Also in the your linked notebook example / the linked Cross Validated example, note how the final `w` likelihood terms use `data[d][i]` as the `value` parameter, and set `observed=True`. In pymc, this means that term will be a likelihood term using the `value` argument as the observed value, for purposes of calculating the gradient of the log likelihood when doing MCMC proposals. But if `data[d][i]` is *count data* for vocabulary entry `i` inside document `d`, how could this be used as an observation for a `Categorical` random variable that has a topic-based distribution over all words? — ely, Jun 11 '18 at 15:45
That is, the `value` parameter is a count value. But the random variable produces an index drawn from a `Categorical`. How can the count data be used as an observation of that? I assume it has to be a bug, and that pymc interprets a single integer given for `value` as *the index* of the Categorical variable to count as the observed index. So count data would here be incorrectly used as if it represented an index into the vocabulary. — ely, Jun 11 '18 at 15:46
In fact, [these slides](https://conference.scipy.org/scipy2010/slides/lightning/dan_williams_pymc.pdf) seem to confirm what I am saying about the misuse of `Categorical`, see the part defining `exp_data` to be a sequence of the observed category labels. — ely, Jun 11 '18 at 15:51

kedarps · Answer 1 · 2018-06-11T14:37:01.920

2

The LDA topic model indeed works on per-document word counts using the bag-of-words representation of each document. The bag-of-words representation ignores the word position in a document.

... the parameter $N_d$ is the length of document $d$. So in this setting, the parameters clearly should correspond to all of the word positions, and not to the word counts.

Your confusion is probably due to $N_d$ appearing in the graphical model formulation. $$ z_{d=1…M, w=1…N_d} ∼ \text{Categorical}_K(\textbf{θ}_d) \\ w_{d=1…M,w=1…N_d} ∼ \text{Categorical}_V(\textbf{ϕ}_{z_{d,w}}) $$

The above two statements indicate the generative process of the LDA model. For each word ($w$) in a document ($d$), you first draw a topic from the distribution $\mathbf{\theta}_d$:

$$ z_{d,w} \sim \text{Categorical}_K(\textbf{θ}_d) $$

Now, $z_{d,w} \in \{1, 2, ... K\}$. Remember, there are $K$ topics, where each topic is a distribution over $V$ words. You need $z_{d,w}$ to select from which topic to draw the word. Once you select the topic, a word can be drawn from that topic,

$$ w_{d,w} \sim \text{Categorical}_V(\textbf{ϕ}_{z_{d,w}}) $$

Since, there are $M$ documents with $N_d$ words per document you repeat the above procedure for each word in a document.

The Gibbs sampling procedure given by Griffiths et al. to infer which topic is associated with word can give insight into this,

$$ P(z_i=k \mid \textbf{z}_{-i} , \textbf{w} ) \propto \frac{n^{(w_i)}_{-i,k}+\beta}{n^{(.)}_{-i,k}+V\beta} \times \frac{n^{(d)}_{-i,k}+\alpha}{n^{(d)}_{-i,.}+K\alpha} $$

$P(z_i=k \mid \textbf{z}_{-i} , \textbf{w} )$ refers to the probability of assigning topic $k$ to $i^{th}$ word, given all other assignments. This depends on two probabilities:

Probability of word $w_i$ in topic $k$
Probability of topic $k$ in document $d$

These probabilities can be easily computed using the following counts:

$n^{(w_i)}_{-i,k}:$ number of times word $w_i$ was assigned to topic $k$
$n^{(.)}_{-i,k}:$ total number of words assigned to topic $k$
$n^{(d)}_{-i,k}:$ number of times topic $j$ was assigned in document $d$
$n^{(d)}_{-i,.}:$ total number of topics assigned in document $d$
$K:$ number of topics
$V:$ number of words in vocabulary
$\alpha, \beta:$ Dirichlet hyperparameters

You can see that to calculate which topic is associated with a particular word does not depend on word position but only word counts.

About Implementation:

You can implement the above equation in multiple ways. I haven't looked at the implementation in PyMC, but I have worked with the authors' Gibbs sampling implementation in C. They provide it in this MATLAB toolbox.

edited Jun 11 '18 at 14:37

answered Jun 11 '18 at 14:08

kedarps

2,902
2
19
30

You mentioned, "Your confusion is probably due to $N_{d}$ appearing in the graphical model formulation" but I think if the model starts from word counts like you describe, then I am actually confused by what $N_{d}$ means in the final all-words categorical distribution. $w_{d=1 \dots M,w=1 \dots N_d} \sim \operatorname{Categorical}_V(\boldsymbol\phi_{z_{dw}})$ – ely Jun 11 '18 at 14:11
That $w_{d,w}$ term appears to be a distribution over words, so if we look at $w_{2, 27}$if would be the distribution (over the vocabulary) for the 27th word from the start of document 2. That's what I'm having a conceptual difficulty with. If this is the likelihood term where we use observations, wouldn't this term require its observation to be the one-hot vector for whatever the true word was at position 27 in document 2? *NOT* whatever the count was for *word number 27 of the vocabulary* in document 2. So there seems to be a gap here still. – ely Jun 11 '18 at 14:15
See the [comment thread](https://stats.stackexchange.com/questions/349638/understanding-the-role-of-document-size-parameters-in-latent-dirichlet-allocatio#comment660921_349638) for hopefully a better clarification of the issue. – ely Jun 11 '18 at 14:31
I have edited the answer with more clarification about the generative process. __"That $w_{d,w}$ term appears to be a distribution over words, so if we look at w2,27 if would be the distribution (over the vocabulary) for the 27th word from the start of document 2."__ -- This is false. $\phi_{k}$ is a distribution over words, not $w_{d,w}$. The $w^{th}$ word in the $d^{th}$ document is drawn from the $z_{d,w}^{th}$ topic distribution. – kedarps Jun 11 '18 at 14:45
I think you misread my comments. Your comment about $\phi_{k}$ seems like it agrees with what I wrote. The $z_{d, w}$ topic distribution is a *distribution over words*, and this is what's used. I understand completely it's generative. But you aren't addressing how the data is used as observation in the likelihood terms (the observed $w_{d, w}$) when fitting the model. That's the whole crux of the question. – ely Jun 11 '18 at 14:50
It seems to me he has for the bag-of-words representation, it's explicit in the formulation he gives. For the list-of-words representation, you can just convert it to a bag-of-words by accumulating counts of the observed word frequencies. – jbowman Jun 11 '18 at 16:27
@jbowman Can you elaborate? How is the $w_{d,w}$ likelihood implicit in the formulation? There's still not a clear answer on what data is fed as observation for that? It can't be count data, as none of the generative distribution terms model count data directly. And it still leaves open the problem of having $N_d$ different generative terms but only $V$ different vocabulary entries to have count data for. – ely Jun 11 '18 at 16:58
@jbowman If you look at the log-likelihood model used in the other pymc [tutorial that I linked](http://docs.pymc.io/notebooks/lda-advi-aevb.html) you'll see there is a big difference. It that model, the count data itself is held out and reserved only to function as a multiplier in the log likelihood (an exponent in the plain likelihood), and the likelihood itself is constructed differently than described in this answer (see cell 23). – ely Jun 11 '18 at 17:03
OK, i'll look at that in more detail. Maybe I made an assumption while reading through it about how it works that wasn't warranted. – jbowman Jun 11 '18 at 17:33

Understanding the role of document size parameters in Latent Dirichlet Allocation

1 Answers1

About Implementation:

Linked