I found an interesting question around topic modelling for small corpora: Topic models for short documents
This got me to wondering if i am doing analysis on topic modelling on small amounts of text using BiTerm should I really be doing stemming as part of the data curation phase?
The reason I ask is I am currently stemming and finding that there are very few words with overlapping roots, and i speculate that if I didn't stem, I'd find more "readable" words in terms of output results.
In other words I can see the argument for stemming large text corpora but I'm wondering if there are diminishing returns for text of 100 words or less. Note that I am removing stop words prior to analysis.
For example analysis of a 100 (approx) word instant message thread could look like this when stemmed:
p(z) Top words
0.134190 doesn:0.112132 t:0.112132 take:0.112132 like:0.112132 5:0.112132 minut:0.112132 build:0.112132 script:0.020221 simpl:0.020221 ahoy:0.001838
0.118805 said:0.105372 someth:0.105372 12:0.105372 16:0.105372 hour:0.105372 list:0.105372 small:0.043388 thing:0.043388 test:0.043388 nvidia:0.022727
0.113676 figur:0.109914 probabl:0.109914 merg:0.109914 without:0.109914 much:0.109914 troubl:0.109914 approx:0.045259 flrdkjdjd:0.023707 driver:0.023707 40:0.023707
0.103420 now:0.120283 let:0.120283 s:0.120283 keep:0.120283 separ:0.120283 hackish:0.120283 chang:0.025943 name:0.025943 ahoy:0.002358 hey:0.002358
0.093163 well:0.106771 i386:0.106771 binari:0.106771 avail:0.106771 don:0.106771 need:0.054688 fix:0.054688 today:0.054688 40:0.028646 around:0.028646
0.088034 will:0.112637 work:0.112637 integr:0.112637 solut:0.112637 hoari:0.112637 ati:0.057692 flrdkjdjd:0.030220 driver:0.030220 happi:0.030220 directli:0.030220
0.077777 issu:0.095679 remain:0.095679 next:0.095679 upload:0.095679 skip:0.064815 kernel:0.064815 part:0.064815 mani:0.064815 interest:0.033951 tester:0.033951
0.067521 also:0.109155 ubuntu:0.109155 user:0.109155 help:0.109155 alreadi:0.038732 yesterday:0.038732 kbd:0.038732 debian:0.038732 want:0.038732 can:0.038732
0.057264 plan:0.127049 stay:0.127049 late:0.127049 tonight:0.127049 whatev:0.045082 easiest:0.045082 ahoy:0.004098 hey:0.004098 morn:0.004098 ati:0.004098
0.052136 provid:0.138393 x:0.138393 happi:0.093750 directli:0.093750 2:0.049107 major:0.049107 ahoy:0.004464 hey:0.004464 morn:0.004464 ati:0.004464
0.052136 feel:0.138393 just:0.138393 go:0.138393 ahead:0.138393 ahoy:0.004464 hey:0.004464 morn:0.004464 ati:0.004464 flrdkjdjd:0.004464 driver:0.004464
0.041879 minor:0.114130 first:0.114130 make:0.114130 interest:0.059783 tester:0.059783 ahoy:0.005435 hey:0.005435 morn:0.005435 ati:0.005435 flrdkjdjd:0.005435
And this when not stemmed:
p(z) Top words
0.138056 x:0.105655 doesn:0.105655 t:0.105655 take:0.105655 like:0.105655 5:0.105655 minutes:0.105655 build:0.105655 ahoy:0.001488 hey:0.001488
0.120888 one:0.103041 figured:0.103041 probably:0.103041 merged:0.103041 without:0.103041 much:0.103041 trouble:0.103041 bunch:0.035473 bug:0.035473 fixes:0.035473
0.099428 s:0.123984 now:0.103659 let:0.103659 keep:0.103659 separate:0.103659 hackish:0.103659 skip:0.042683 kernel:0.042683 part:0.042683 detail:0.022358
0.099428 can:0.184959 well:0.083333 i386:0.083333 binaries:0.083333 available:0.083333 much:0.063008 want:0.063008 test:0.063008 t:0.042683 don:0.042683
0.095136 one:0.129237 will:0.108051 work:0.108051 integrated:0.108051 solution:0.108051 hoary:0.108051 nvidia:0.023305 blind:0.023305 uploads:0.023305 major:0.023305
0.090844 feel:0.090708 merge:0.090708 just:0.090708 go:0.090708 ahead:0.090708 also:0.068584 ubuntu:0.068584 users:0.068584 help:0.068584 whatever:0.024336
0.086552 said:0.118056 something:0.118056 12:0.118056 16:0.118056 hours:0.118056 list:0.118056 kbd:0.025463 debian:0.025463 ahoy:0.002315 hey:0.002315
0.073676 x:0.163978 happy:0.083333 provide:0.083333 directly:0.083333 need:0.083333 fix:0.083333 today:0.083333 change:0.029570 names:0.029570 ahoy:0.002688
0.056508 issues:0.106164 remain:0.106164 next:0.106164 upload:0.106164 script:0.037671 simple:0.037671 2:0.037671 major:0.037671 approx:0.037671 40:0.037671
0.052216 ati:0.077206 flrdkjdjds:0.077206 driver:0.077206 minor:0.077206 first:0.077206 make:0.077206 already:0.040441 yesterday:0.040441 2:0.040441 changes:0.040441
0.043632 small:0.090517 things:0.090517 tested:0.090517 many:0.090517 interested:0.090517 testers:0.090517 ahoy:0.004310 hey:0.004310 morning:0.004310 ati:0.004310
0.043632 planning:0.133621 stay:0.133621 late:0.133621 tonight:0.133621 ahoy:0.004310 hey:0.004310 morning:0.004310 ati:0.004310 flrdkjdjds:0.004310 driver:0.004310
Is there literature to suggest either way of the efficacy of stemming on small amounts of text?