7

I am new to topic modeling and read about LDA and NMF (Non-negative Matrix Factorization). I understand the training process work. Let's say I have 100 documents and I want to train an LDA for these documents with 10 topics. However, I don't really understand how does this model assign topic to an unseen document?

I used Gensim. After training, I have an LDA trained model and a dictionary with most frequent words. Let's say, I have an unseen new document with the following text:

This is just a test text about topic modeling and LDA. 

Can someone explain step by step how a topic distribution is assigned to this new document in terms of algorithmic steps? The same goes for NMF method.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
nickg
  • 71
  • 1
  • 3
  • By the context, I understand that LDA refers to Latent Dirichlet Allocation, but please clarify this in the question. Also include the full name for Non-negative Matrix Factorization. – Daniel López Jan 29 '18 at 14:37
  • The Bayes decision rule of assigning topics to new documents depends on the loss function. – Łukasz Grad Jan 29 '18 at 14:49
  • LDA does not assign topics to documents, it assigns topics to words and topic-distributions to documents. – guy Jan 29 '18 at 15:23
  • @guy I should have explicitly specified that. I meant topic distribution. – nickg Jan 29 '18 at 15:25
  • The topic distribution represented as a point on the $n_{topic}$-dimensional simplex, and is inferred by looking at the posterior under a Dirichlet prior. If we were to use, say, a Gibbs sampler, the topic distribution would be updated across iterations by sampling from the associated full conditional, which by conjugacy is another Dirichlet. – guy Jan 29 '18 at 17:23

1 Answers1

2

What you should actually do is run inference (training) on the new set of documents (the old ones and the new ones together). A short-cut that estimates this well is applying Gibbs sampling only to the new documents while using the data obtained during training unchanged, as described by @SheldonCooper in Topic prediction using latent Dirichlet allocation.

emem
  • 123
  • 5