I am trying to use vowpal wabbit to do Latent Dirichlet Analysis (LDA) on a corpus. I am running into a few issues regarding the output.
To test it, I was using a file with just 3 lines (3 documents as per the VW input format):
| now let fit a topic model on this dataset
| now let a good model on this dataset
| this is a document about sports
I ran VW LDA in the following manner:
vw --lda 2 --lda_D 3 --readable_model lda.model.txt -k --passes 10
--cache_file doc_tokens.cache -d 1.txt -p prediction.dat --lda_rho 0.1
The code runs fine and generates two output files prediction.dat
and lda.model.txt
. My questions related to it are:
Except the first column, both the output files have a sequence of floating point numbers. Like
262130 0.100008 0.100009 262131 0.100013 0.100021 262132 0.100008 0.100010 262133 0.100018 0.100008 262134 0.100005 0.100008 262135 0.100010 0.100007 262136 0.100008 0.100026 262137 0.100005 0.100012 262138 0.100008 0.100014 262139 0.100005 0.100018 262140 0.100006 0.100007 262141 0.100006 0.100006 262142 0.100019 0.100023 262143 0.100019 0.100007
I thought giving
--readable_model
will give the strings representing the topics. Am I doing something wrong?No matter how many documents I give, the output file (
lda.model.txt
) has262143
rows of examples. Why is it doing that?