0

I found an interesting question around topic modelling for small corpora: Topic models for short documents

This got me to wondering if i am doing analysis on topic modelling on small amounts of text using BiTerm should I really be doing stemming as part of the data curation phase?

The reason I ask is I am currently stemming and finding that there are very few words with overlapping roots, and i speculate that if I didn't stem, I'd find more "readable" words in terms of output results.

In other words I can see the argument for stemming large text corpora but I'm wondering if there are diminishing returns for text of 100 words or less. Note that I am removing stop words prior to analysis.

For example analysis of a 100 (approx) word instant message thread could look like this when stemmed:

p(z)        Top words
0.134190    doesn:0.112132 t:0.112132 take:0.112132 like:0.112132 5:0.112132 minut:0.112132 build:0.112132 script:0.020221 simpl:0.020221 ahoy:0.001838
0.118805    said:0.105372 someth:0.105372 12:0.105372 16:0.105372 hour:0.105372 list:0.105372 small:0.043388 thing:0.043388 test:0.043388 nvidia:0.022727
0.113676    figur:0.109914 probabl:0.109914 merg:0.109914 without:0.109914 much:0.109914 troubl:0.109914 approx:0.045259 flrdkjdjd:0.023707 driver:0.023707 40:0.023707
0.103420    now:0.120283 let:0.120283 s:0.120283 keep:0.120283 separ:0.120283 hackish:0.120283 chang:0.025943 name:0.025943 ahoy:0.002358 hey:0.002358
0.093163    well:0.106771 i386:0.106771 binari:0.106771 avail:0.106771 don:0.106771 need:0.054688 fix:0.054688 today:0.054688 40:0.028646 around:0.028646
0.088034    will:0.112637 work:0.112637 integr:0.112637 solut:0.112637 hoari:0.112637 ati:0.057692 flrdkjdjd:0.030220 driver:0.030220 happi:0.030220 directli:0.030220
0.077777    issu:0.095679 remain:0.095679 next:0.095679 upload:0.095679 skip:0.064815 kernel:0.064815 part:0.064815 mani:0.064815 interest:0.033951 tester:0.033951
0.067521    also:0.109155 ubuntu:0.109155 user:0.109155 help:0.109155 alreadi:0.038732 yesterday:0.038732 kbd:0.038732 debian:0.038732 want:0.038732 can:0.038732
0.057264    plan:0.127049 stay:0.127049 late:0.127049 tonight:0.127049 whatev:0.045082 easiest:0.045082 ahoy:0.004098 hey:0.004098 morn:0.004098 ati:0.004098
0.052136    provid:0.138393 x:0.138393 happi:0.093750 directli:0.093750 2:0.049107 major:0.049107 ahoy:0.004464 hey:0.004464 morn:0.004464 ati:0.004464
0.052136    feel:0.138393 just:0.138393 go:0.138393 ahead:0.138393 ahoy:0.004464 hey:0.004464 morn:0.004464 ati:0.004464 flrdkjdjd:0.004464 driver:0.004464
0.041879    minor:0.114130 first:0.114130 make:0.114130 interest:0.059783 tester:0.059783 ahoy:0.005435 hey:0.005435 morn:0.005435 ati:0.005435 flrdkjdjd:0.005435

And this when not stemmed:

p(z)        Top words
0.138056    x:0.105655 doesn:0.105655 t:0.105655 take:0.105655 like:0.105655 5:0.105655 minutes:0.105655 build:0.105655 ahoy:0.001488 hey:0.001488
0.120888    one:0.103041 figured:0.103041 probably:0.103041 merged:0.103041 without:0.103041 much:0.103041 trouble:0.103041 bunch:0.035473 bug:0.035473 fixes:0.035473
0.099428    s:0.123984 now:0.103659 let:0.103659 keep:0.103659 separate:0.103659 hackish:0.103659 skip:0.042683 kernel:0.042683 part:0.042683 detail:0.022358
0.099428    can:0.184959 well:0.083333 i386:0.083333 binaries:0.083333 available:0.083333 much:0.063008 want:0.063008 test:0.063008 t:0.042683 don:0.042683
0.095136    one:0.129237 will:0.108051 work:0.108051 integrated:0.108051 solution:0.108051 hoary:0.108051 nvidia:0.023305 blind:0.023305 uploads:0.023305 major:0.023305
0.090844    feel:0.090708 merge:0.090708 just:0.090708 go:0.090708 ahead:0.090708 also:0.068584 ubuntu:0.068584 users:0.068584 help:0.068584 whatever:0.024336
0.086552    said:0.118056 something:0.118056 12:0.118056 16:0.118056 hours:0.118056 list:0.118056 kbd:0.025463 debian:0.025463 ahoy:0.002315 hey:0.002315
0.073676    x:0.163978 happy:0.083333 provide:0.083333 directly:0.083333 need:0.083333 fix:0.083333 today:0.083333 change:0.029570 names:0.029570 ahoy:0.002688
0.056508    issues:0.106164 remain:0.106164 next:0.106164 upload:0.106164 script:0.037671 simple:0.037671 2:0.037671 major:0.037671 approx:0.037671 40:0.037671
0.052216    ati:0.077206 flrdkjdjds:0.077206 driver:0.077206 minor:0.077206 first:0.077206 make:0.077206 already:0.040441 yesterday:0.040441 2:0.040441 changes:0.040441
0.043632    small:0.090517 things:0.090517 tested:0.090517 many:0.090517 interested:0.090517 testers:0.090517 ahoy:0.004310 hey:0.004310 morning:0.004310 ati:0.004310
0.043632    planning:0.133621 stay:0.133621 late:0.133621 tonight:0.133621 ahoy:0.004310 hey:0.004310 morning:0.004310 ati:0.004310 flrdkjdjds:0.004310 driver:0.004310

Is there literature to suggest either way of the efficacy of stemming on small amounts of text?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Jonathan Dunne
  • 452
  • 4
  • 15
  • I would recommend stemming, especially if you are dealing with a small corpus because based in its size, the topics may be quite sensitive to minor changes in term frequencies. As far as readability is concerned, there are ways to complete or reconstruct your stems even though they aren't error-proof. Have a look at stemCompletion function in R, tm package – de1pher Oct 12 '17 at 19:09
  • I haven't looked at stem completion in R but will check it out. in terms of stemming did a comparison of the word count across three separate threads out of approx 3 * 100 word threads only 2 words were stemmed. in other words the number of words dropped from 300 down to 298 words after stemming. this intuition leads me to believe based on the corpora analysis so far there are not many stemable words. therefore stemming would not be justified. thoughts? – Jonathan Dunne Oct 12 '17 at 19:59
  • If that's the case then I think you may find that you model is not going to perform well either way – de1pher Oct 12 '17 at 21:00

0 Answers0