1

I ran a 5-fold cross-validation in R to calculate LDA perplexity for k = 2:9 using a 10% sample of my data. The output was:

     2      3      4      5      6      7      8      9 
156277 139378  71659  68998  67471  32890  32711  31904

I re-ran the CV with the full data set, using the same values of k, and obtained a similar output:

     2      3      4      5      6      7      8      9 
182572 134480  76722  73285  71907  35052  34238  33438 

My problem: I decided to try a wider range of k using the same created folds of the full data (i.e., the composition of the 5 folds remained exactly the same), but the output doesn't seem comparable. k = c(2:5,10,15,20,30,40,50,75,100):

     2      3      4      5     10     15     20     30     40     50     75 
243384 180662 151901 110627  99078  93311  59114  56176  54383  26711  26162 
   100 
 25723 

Why can I compare perplexity between my first two outputs while the third output doesn't appear to be comparable? For example, k = 9 in the first two outputs hovers around a perplexity of 32,000. In the third output, k = 10 is nearly 100,000 in perplexity—nowhere near 32,000. Wouldn't we expect perplexity for k = 10 to remain close to 32,000? The topicmodels::LDA and topicmodels::perplexity calculations don't change if the fold and k remain the same.

Is perplexity supposed to change like this? I wasn't able to set a common seed across these three calculations but I wasn't expecting to see such a drastic difference in the third calculation.

gserapio
  • 11
  • 2

0 Answers0