Latent Dirichlet Allocation vs. pLSA

Question

In the original LDA paper it is stated that:

The parameters for a k-topic pLSI model are k multinomial distributions of size V and M mixtures over the k hidden topics. This gives kV +kM parameters and therefore linear growth in M. The linear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem[.]

Also:

LDA is a well-defined generative model and generalizes easily to new documents. Furthermore, the k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.

But what I understand is that LDA also has those $kV + kM$ parameters but not as hyper-parameters. So this is irrelevant to overfitting. I.e., in pLSA these posteriors must be estimated ($M$ is the number of documents):

$p(z|d): kM$ parameters,

$p(w|z): kV$ parameters,

and in LDA the following posteriors have to be estimated:

$p(\Theta_d|\alpha): kM$ parameters ($\Theta_d$ is $k$-dimensional),

$p(w|z): kV$ parameters,

and two parameters $\alpha$ and $\eta$, (called hyperparameters).

Thus, the number of posteriors to be estimated is approximately the same. Why LDA is claimed to have solved overfitting problem of pLSA? I agree that since Dirichlet distribution with a low $\alpha$ tends to generate sparser distributions than Dirichlet with $\alpha=1$ (or uniform) as in pLSA, and this sparsity might help reducing the overfitting a bit, but still the number of parameters are similar.

score 2 · Answer 1 · answered Jun 17 '15 at 11:14

2

We see that pLSI describes a process for generating documents with topic distributions p(z | d) seen in a particular document in the collection as opposed to generating documents with arbitrary topic proportions from a prior probability distribution. This may not be crucial in information retrieval where the current document collection to be stored can be viewed as a ﬁxed collection. However, in applications such as text categorization, it is crucial to have a model ﬂexible enough to properly handle text that has not been seen before.

Thus probability of documents in pLSI are points (one point - one document from your collection), while in LDA, there is full topic simplex to use (of course, after training you have dirichlet distribution), so there are no problems to take new document.

answered Jun 17 '15 at 11:14

midi

151
1
7

I don't get your point. Doesn't both have similar number of parameters? Moreover, Isn't pLSA identical to LDA with dirichlet parameter \alpha=1? How can they be totally different when one is similar to a special case of the other? – Shayan Jun 18 '15 at 19:29
LDA spans a hyperspace and is therefore generalizable to unseen documents. The other(pLSA) does not generalize to unseen documents and therefore can be thought as always over fitting the training set. This is the advantage. It is never in terms of the number of parameters. – midi Jun 19 '15 at 04:25
When pLSA is identical to LDA with Alpha=1, why do you say one generalizes and the other does not? Yes, sparsity of Dirichlet with Alpha<<1 might make that happen, but if you learn a LDA with Alpha=1, I think there is no difference in generalization or parameter count – Shayan Jun 27 '15 at 08:57

Sean Easter · Answer 2 · 2015-06-27T17:02:55.827

2

To the question of parameters, in LDA the parameters to be learned are:

$\alpha$, the $k$-dimensional corpus-level Dirichlet parameter from which each $\theta_d$ is drawn.
$\beta$, the $k \times V$ matrix built such that $\beta_{i,j} = p(w^j | z^i)$. Put another way, the $i$-th row of $\beta$ is the parameter for a categorical distribution of a word for topic $z^i$.

These are the $k + kV$ parameters referenced in the paper, and the only ones the model needs to learn. In other words, each $\theta_d$ is drawn from a Dirichlet with the learned parameter $\alpha$; it is not itself learned as part of training the model. (Software packages implementing LDA do allow you to represent a document as its distribution over topics, but this is a consequence of the trained model, not a prerequisite for it.)

There's a clue to this in the LDA paper in section 5.3:

In particular, given a corpus of documents $D = {\bf{w}_1,\bf{w}_2,...,\bf{w}_M}$, we wish to find parameters $\alpha$ and $\beta$ that maximize the (marginal) log likelihood of the data...

One potential source of confusion is that $\alpha$ itself has a prior, the tuning of which can effect topical sparsity.

edited Jun 27 '15 at 17:02

answered Jun 21 '15 at 20:52

Sean Easter

8,359
2
29
58

Not realistic, I think! In fact what you said "is not itself learned as part of training the model" is similar to that of the pLSA. i.e. When pLSA is identical to LDA with Alpha=1, if you consider Theta_d not part of model, why do you consider it as a part of model in pLSA?! – Shayan Jun 27 '15 at 08:55
Because that is the primary distinction between the two :) From the LDA paper, p. 1001: "LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set." Also, one can't really speak of $\alpha=1$ in LDA, as it's a vector parameter. (Implementations may use a single value for every element in the *prior* of this parameter, but this doesn't lead to formal equivalence between the models.) – Sean Easter Jun 27 '15 at 15:17
In LDA, symmetric Dirichlet is usually used which has only one parameter. Moreover, That's the question. I know they claim that. But I calculated the number of parameters. Do you see any difference? – Shayan Jun 30 '15 at 19:51
That is not true of the [symmetric Dirichlet](https://en.wikipedia.org/wiki/Dirichlet_distribution#Special_cases): It is a *vector* parameter, of which every element is the same value. (In this case, it is $k$-dimensional.) To reiterate, your calculation treats $\Theta$ as a parameter to be learned, which it is not; $\beta$ is the parameter learned in the model, from which each $\theta_i$ is drawn. To use a simple analogy, one can calculate error terms after training a linear regression, but we do not speak of these as parameters learned during training. – Sean Easter Jun 30 '15 at 20:37

Latent Dirichlet Allocation vs. pLSA

2 Answers2