1

I've performed Latent Dirichlet Analysis on a training set of documents.

At the ideal number of topics I would expect a minimum of perplexity for the test dataset. However, I find that the perplexity for my test dataset increases with number of topics.

enter image description here

I'm using sklearn to do LDA. The code I'm using to generate the plot is:

train, test = train_test_split(list(df['document'].values), test_size = 0.2)
vectoriser = CountVectorizer(stop_words = 'english', max_features=1500)
doc_train = vectoriser.fit_transform(train)
features = vectoriser.get_feature_names()
doc_test = vectoriser.fit_transform(test)

perplexity = []
alpha = 0.1
beta = 0.1

for topics in range(1, 21, 2):
    # Fit LDA to the data 
    LDA = LatentDirichletAllocation(n_components = topics, doc_topic_prior = alpha, topic_word_prior = beta)
    news_lda = LDA.fit(doc_train)
    perplexity.append(news_lda.perplexity(doc_test))

I can't work out how perplexity is calculated by LatentDirichletAllocation.perplexity() and whether it is this function that is causing problems. Any ideas?

BHC
  • 141
  • 1
  • 3
  • Have you tried different values of `alpha` and `beta`? Choosing hyper-parameters for topic modeling is tricky and can lead to very different results. See this [answer](https://stats.stackexchange.com/questions/349761/reasonable-hyperparameter-range-for-latent-dirichlet-allocation/351183#351183) for more information. – kedarps Aug 30 '18 at 14:55
  • That's a good point - the values of alpha and beta I am using are low because I want to prefer classifying to few topics, but possibly this is causing all the documents to be assigned to one topic, which would be simple enough to visualise. (I do still think there may be an underlying issue with using LDA.perplexity() though - where would it get document-topic allocations for the train set?) In the end I decided that to aid topic interpretability I wanted to inform an asymmetric Dirichlet prior instead - I highly recommend the package https://github.com/vi3k6i5/GuidedLDA for doing this. – BHC Aug 31 '18 at 09:27

1 Answers1

0

From this stack overflow answer, it appears there is a bug in how perplexity is calculated in LatentDirichletAllocation.perplexity().