I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA
but no difference.
I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.
Definition and details regarding calculation of perplexity of a topic model is explained in this post.
Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this post notes, you bias optimal number of topics according to specific values of hyperparameters.