How to use LDA to classify documents into pre defined topics

Question

LDA is unsupervised and it classifies documents into topics. But, is there a way to make the LDA classify the documents into the predefined (or specific desired) topics.

Below link says we need custom beta prior where we provide more weights to some keywords (that correspond to the desired predefined topics) to achieve the above goal.

https://towardsdatascience.com/a-machine-learning-approach-to-automated-customer-satisfaction-surveys-946d2604e309

I cannot figure out how, though. Can someone please point me to an example/implementation on how to use LDA to classify into predefined topics

In general, How do we classify documents into topics if we don't labels, are lda and general clustering methods the only way ? or are there any semi supervised methods that can provide better classifications?

EzioBosso · Answer 1 · 2020-10-17T20:25:49.093

It isn't really clear, but I think what he is doing is weighing words found under "predefined" topic tags in a discussion board, and then weighing those words (X1000) in the sampling process of LDA.

For example, if I search stats.stackexchange under the tag "natural-language" and create a vocabulary of, word : # times word appeared, and remove stopwords (common words) I will probably get something like:

$$ \begin{align} \text{nlp} &~|~ 10000 \\ \text{classify} &~|~ 9500 \\ \text{text} &~|~ 9273 \\ \text{deep} &~|~ 3000 \\ \text{modelling} &~|~ 324 \\ \text{lda} &~|~ 234 \\ \text{gibbs} &~|~ 230 \\ \end{align} $$

Alternatively, the predefined topic tag has key words associated with it already (which he uses and weighs more). Sticking to our example, the "natural-language" tag mentions words like: linguistics, artificial, intelligence, machine, learning. We weigh these words higher.

Then in the sampling process for any word $w_i$ with associated topic-weight $b_{z,i}$ (where $z$ represents the specific topic), we just multiply it by some constant (here being 1000), i.e. $b_{z,i} \cdot 1000$.

I didn't read the whole article, but I only see this being useful if you only have a few key-words per predefined topic. I think it would be better to use something like word2vec, or just cosign distance of words, for this task instead. LDA wasn't really designed to be used when we already have predefined topics.

Thanks. How do we classify documents into topics if we don't labels, are lda and general clustering methods the only way ? or are there any semi supervised methods that can provide better classifications? — tjt, Oct 17 '20 at 21:56
Depends what you mean by better. 1) If you already have topic labels for documents (perhaps some manually tagged them), then you can use any machine learning classification model to tag new documents (words - are features, label is the manually assigned topics). Some models will give better accuracy in tagging. 2) If you don't already have labels then you can run LDA over your documents. For new documents you can predict on them (hold fixed the topic-word distributions) which give you a document-topic distribution for the new documents, i.e. new doc $d_1$ is 90% topic 1, 5% topic 2, ... — EzioBosso, Oct 18 '20 at 00:22

How to use LDA to classify documents into pre defined topics

1 Answers1