Vowpal wabbit LDA

Question

I am trying to use vowpal wabbit to do Latent Dirichlet Analysis (LDA) on a corpus. I am running into a few issues regarding the output.

To test it, I was using a file with just 3 lines (3 documents as per the VW input format):

| now let fit a topic model on this dataset
| now let a good model on this dataset
| this is a document about sports

I ran VW LDA in the following manner:

vw --lda 2 --lda_D 3 --readable_model lda.model.txt -k --passes 10 
   --cache_file doc_tokens.cache -d 1.txt -p prediction.dat --lda_rho 0.1

The code runs fine and generates two output files prediction.dat and lda.model.txt. My questions related to it are:

Except the first column, both the output files have a sequence of floating point numbers. Like

262130 0.100008 0.100009
262131 0.100013 0.100021
262132 0.100008 0.100010
262133 0.100018 0.100008
262134 0.100005 0.100008
262135 0.100010 0.100007
262136 0.100008 0.100026
262137 0.100005 0.100012
262138 0.100008 0.100014
262139 0.100005 0.100018
262140 0.100006 0.100007
262141 0.100006 0.100006
262142 0.100019 0.100023
262143 0.100019 0.100007

I thought giving --readable_model will give the strings representing the topics. Am I doing something wrong?

No matter how many documents I give, the output file (lda.model.txt) has 262143 rows of examples. Why is it doing that?

score 2 · Accepted Answer · edited Apr 13 '17 at 12:32

There are 2 columns of floating-point numbers because you specified 2 topics in your LDA model with the number immediately after --lda.

The first column is numeric and defaults to 262143 elements independent of input size because of the feature hashing that Vowpal Wabbit does. The --help text for --readable_model arg says "Output human-readable final regressor with numeric features" so that is by design, even though it might not pass all tests for "human-readable" (see UX.SE for more discussion on that topic). You can change the number of rows with the -b option (example given here: "-b 16: We expect to see at most 2^16 unique words."). The default is -b 18; 2^18-1 = 262143 rows.

If you convert terms to numbers using an external dictionary so your input file has integers in place of words, VW will conveniently use those integers as the hash value directly, without requiring --audit or --invert_hash.

score 1 · Answer 2 · answered Aug 19 '16 at 23:45

There's a much easier way to run LDA now.

Chetan Ganjihal wrote a wrapper to vw --lda that does all the heavy lifting users usually have to deal with to use vw for LDA.

You may feed this script with a list of files or a directory tree from the command line and it will run vw --lda plus do all needed pre and post processing to find LDA topics from these files.

Check out vw-lda in the source tree

Please note that you also need to have the vw-doc2lda utility in your PATH for it to work.

Vowpal wabbit LDA

2 Answers2