26

Popular topic models like LDA usually cluster words that tend to co-occur together into the same topic (cluster).

What is the main difference between such topic models, and other simple co-occurrence based clustering approaches like PMI ? (PMI stands for Pointwise Mutual Information, and it is used to identify the words that co-occur with a given word.)

Momo
  • 8,839
  • 3
  • 46
  • 59
kanzen_master
  • 1,235
  • 3
  • 15
  • 22

3 Answers3

31

Recently, a huge body of literature discussing how to extract information from written text has grown. Hence I will just describe four milestones/popular models and their advantages/disadvantages and thus highlight (some of) the main differences (or at least what I think are the main/most important differences).

You mention the "easiest" approach, which would be to cluster the documents by matching them against a predefined query of terms (as in PMI). These lexical matching methods however might be inaccurate due to polysemy (multiple meanings) and synonymy (multiple words that have similar meanings) of single terms.

As a remedy, latent semantic indexing (LSI) tries to overcome this by mapping terms and documents into a latent semantic space via a singular value decomposition. The LSI results are more robust indicators of meaning than individual terms would be. However, one drawback of LSI is that it lacks in terms of solid probabilistic foundation.

This was partly solved by the invention of probabilistic LSI (pLSI). In pLSI models each word in a document is drawn from a mixture model specified via multinomial random variables (which also allows higher-order co-occurences as @sviatoslav hong mentioned). This was an important step forward in probabilistic text modeling, but was incomplete in the sense that it offers no probabilistic structure at the level of documents.

Latent Dirichlet Allocation (LDA) alleviates this and was the first fully probabilistic model for text clustering. Blei et al. (2003) show that pLSI is a maximum a-posteriori estimated LDA model under a uniform Dirichlet prior.

Note that the models mentioned above (LSI, pLSI, LDA) have in common that they are based on the “bag-of-words” assumption - i.e. that within a document, words are exchangeable, i.e. the order of words in a document can be neglected. This assumption of exchangeability offers a further justification for LDA over the other approaches: Assuming that not only words within documents are exchangeable, but also documents, i.e., the order of documents within a corpus can be neglected, De Finetti's theorem states that any set of exchangeable random variables has a representation as a mixture distribution. Thus if exchangeability for documents and words within documents is assumed, a mixture model for both is needed. Exactly this is what LDA generally achieves but PMI or LSI do not (and even pLSI not as beautiful as LDA).

Momo
  • 8,839
  • 3
  • 46
  • 59
  • 2
    1/2 Thanks! Very clear. Let me check if I got this right: In LSI, documents are formed by a mixture of words (no notion of topics) and words and documents are mapped to a lower dimension semantic space using SVD. Since words with similar semantic meaning are mapped closer, it can deal with synonymy but has problems with polisemy. pLSI solves the polisemy problem by introducing the concept of topics. In pLSI, words are drawn from a multinomial distribution of words (topics), the same word can belong to several topics and a document has multiple topics, although this is not modeled explicitly. – kanzen_master Jul 16 '12 at 06:21
  • 2/2 pLSI has several problems: large number of parameters that derive in overfitting, and cannot tell the probability of a new document to belong to the corpus since it only learns a combination of topics for a document. LDA extends the pLSI by modeling documents as a mixture of topics, which are drawn from a dirichlet prior. Since LDA learns a distribution over the different topics, the model can be used to cluster unseen documents into learned distributions. It has also less issues with overfitting. – kanzen_master Jul 16 '12 at 06:22
  • 2
    I think generally you get it right. Some smaller corrections: LSI is considered to work ok with both polysemy and synomy. pLSI is basically a formulation to achieve what LSI strives at with the tools of latent class analysis/mixturemodels and probability rather than just linear algebra. LDA as compared to pLSI is a fully generative model by specifying a per-document topic distribution. – Momo Jul 16 '12 at 10:02
  • 1
    Regarding your points on overfitting and prediction, I am not knowledgeable enough for a qualified statement. But, for all its worth, I don't see why LDA should be less prone to overfitting than pLSI (as LDA basically just adds a prior to a pLSI model). Both have no in-built correction for overfitting or the like. "Prediction" of new documents might indeed be easier or feasible with a fully generative model like LDA, see http://stats.stackexchange.com/questions/9315/topic-prediction-using-latent-dirichlet-allocation?rq=1 But I would see LDA as an unsupervised, descriptive model. – Momo Jul 16 '12 at 10:12
  • 1
    Thanks again! Just 2 final questions: (1)Regarding polysemy, [in this pdf, end of page 3](http://boston.lti.cs.cmu.edu/callan/Workshops/lmir01/WorkshopProcs/Papers/hofmann.pdf) Hoffman states that one of the differences of PLSI compared to LSI is polysemy, since the same word can belong to different word distributions(topics); that is why I thought that LSI did not work with polysemy. (2) Regarding overfitting, this [blog](http://ammai2012.blogspot.jp/2012_05_01_archive.html) states that a linear increase of parameters suggests that the model is prone to overfitting. What do you think ? – kanzen_master Jul 16 '12 at 11:37
  • 2
    No problem. You know much about these things already, so I learn stuff too. ad (1) Well, as usual, it depends: LSI can handle polysemy due to the linear combination of terms as done in PCA. It does this better with synonyms, but to a certain degree also with polysemy. Basically polysemous words that are similar are added components of words that share a similar meaning. However, it does it much less well than pLSI as each occurrence of a word being represented as a single point in space. The word representation is therefore an average of all the word's different meanings in the corpus. – Momo Jul 16 '12 at 15:06
  • 1
    Often, a word has one predominant meaning in a corpus, hence the averaging is not that big a problem. (2) The link to the blog doesn't work for me, but I'd say that both models pLSI and LDA (the former being a special case of the latter) can easily be overfit as there is no overfit penalty. Just think of specifying 1000 topics to a small corpus. But I don't see why LDA should overfit less then pLSI. – Momo Jul 16 '12 at 15:17
6

I might be 3 years late but I want to follow up your question on the example of "high-order of co-occurrences".

Basically, if term t1 co-occurs with term t2 that co-occurs with term t3, then term t1 is the 2nd-order co-occurrence with term t3. You can go to higher order if you want but at the end you control how similar two words should be.

suthee
  • 171
  • 1
  • 2
6

LDA can capture higher-order of co-occurrences of terms (due to the assumption of each topic is a multinomial distribution over terms), which is not possible by just computing PMI between terms.

Liangjie Hong
  • 381
  • 2
  • 10