3

A frequent question that arises when I present results of topic modeling to novices is: "How many documents belong to topic x?".

As topic modeling is probabilistic, I hesitate giving absolute frequencies as an answer. But "mean document-topic-probability" is nothing people unexperienced in topic modeling can deal with.

How would you determine the number of documents per topic? Using a cut-off in document-topic-probability?

abitter
  • 53
  • 7
  • Since I had some situations where a clear number of documents per topic was wished (e.g., share of female authors per topic), I ended up with the following approach: **Select all docs with document-topic-probability > .5** Although I agree with @haitao-du, doing it the Bayesian way (mean document-topic-probability of women vs. men) is sometimes hard to communicate. So selecting docs that _mainly_ address the topic might be a workaround for those situations, of course accompanied by stressing the resulting loss of information. – abitter Jul 26 '21 at 11:06

1 Answers1

1

I think this question related to membership assignment and have little thing to do with topic modeling.

Assign a membership based on probability is a question frequently asked. See here as an example.

To maximize the chance of correctly guessing the result of a coin flip, should I always choose the most probable outcome?

We can assign the membership based on Maximum a posteriori estimation or others.

Think about two examples:

  • Suppose there are only $2$ topics, we can easily set the threshold to be $0.5$.
  • Suppose there are $100$ topics, and the probability distribution for a given document is $[0.1, 0.9/99, \cdots, 0.9/99]$. How do we assign the membership? If we assign it to be topic 1 (which is the MAP estimation) but we will get 90% wrong !

So, the answer is just use the probability (Bayesian way), and do not use a threshold to assign membership.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213