I've performed Latent Dirichlet Analysis on a training set of documents.
At the ideal number of topics I would expect a minimum of perplexity for the test dataset. However, I find that the perplexity for my test dataset increases with number of topics.
I'm using sklearn to do LDA. The code I'm using to generate the plot is:
train, test = train_test_split(list(df['document'].values), test_size = 0.2)
vectoriser = CountVectorizer(stop_words = 'english', max_features=1500)
doc_train = vectoriser.fit_transform(train)
features = vectoriser.get_feature_names()
doc_test = vectoriser.fit_transform(test)
perplexity = []
alpha = 0.1
beta = 0.1
for topics in range(1, 21, 2):
# Fit LDA to the data
LDA = LatentDirichletAllocation(n_components = topics, doc_topic_prior = alpha, topic_word_prior = beta)
news_lda = LDA.fit(doc_train)
perplexity.append(news_lda.perplexity(doc_test))
I can't work out how perplexity is calculated by LatentDirichletAllocation.perplexity() and whether it is this function that is causing problems. Any ideas?