NLTK BigramAssocMeasures.pmi is give same score for all bigrams

Question

I am trying to use BigramAssocMeasures PMI to find the most import bigrams however it's giving all Bigrams the same score, so I end up with a list in alphabetical order when I use .nbest. Where as when I just bigram_measures.likelihood_ratio the results seems correct.

The text I am analyzing is lots of questions combined together to form a 6 million word corpus. Could someone please tell me what I am doing wrong?

   bigram_measures = collocations.BigramAssocMeasures()

   finder = nltk.BigramCollocationFinder.from_words(full_text)
   finder.apply_word_filter(lambda x: x in stopwords)
   scored = finder.score_ngrams(bigram_measures.pmi)
   for bscore in scored[:30]:
       print (bscore)

Output

(('\x02tñ\x7f¼é\x1aaùõ\x8d¶rwìiìñó', '\x10œø'), 22.60745494022481)
(('\x10œø', '\x17'), 22.60745494022481)
(('\x17', 'y.¾ƒe'), 22.60745494022481)
(("'07", "'08"), 22.60745494022481)
(("'20s", "'30s"), 22.60745494022481)
(("'24-jan-2018", "'24/01/2018"), 22.60745494022481)
(("'42", 'salko'), 22.60745494022481)
(("'acclaimed", 'musician/'), 22.60745494022481)
(("'adiye", 'manam'), 22.60745494022481)
(("'afflict", "'inflict"), 22.60745494022481)
(("'allegretto", 'tranquillo'), 22.60745494022481)
(("'amar", 'maruf'), 22.60745494022481)
(("'anekantwad", "'syadvada"), 22.60745494022481)
(("'anger", "'anticipation"), 22.60745494022481)
(("'annum", "'year"), 22.60745494022481)
(("'anti-fracking", 'anti-pipeline'), 22.60745494022481)
(("'anyway", "'anyways"), 22.60745494022481)
(("'apoapsis", "'periapsis"), 22.60745494022481)
(("'association", "'sponsors"), 22.60745494022481)
(("'audacious", "'audacity"), 22.60745494022481)
(("'babu", "'shona"), 22.60745494022481)
(("'baklava", "'balaclava"), 22.60745494022481)
(("'baniya", "'ambani"), 22.60745494022481)
(("'bet", "'cast"), 22.60745494022481)
(("'bhakt", "'chamcha"), 22.60745494022481)
(("'bheege", 'honth'), 22.60745494022481)
(("'blinded", 'beleif'), 22.60745494022481)

You may want to apply your tokenization and stopword filtering earlier rather than after you have called the finder — Leslie B., Jun 29 '19 at 00:32
@LeslieB. even after your advice I get the same PMI for all the bigrams. What else can we do? — Mr. Unnormalized Posterior, Sep 27 '19 at 09:16

NLTK BigramAssocMeasures.pmi is give same score for all bigrams

0 Answers0