NLP various probabilities estimators in nltk

Question

I saw there are many types of probabilities in nltk: MLE, ELE, Laplace, Heldout, KnereserNey, Lidstone, Random, WittenBel.. What is the exact difference between them and when should I use each?

My goal is to get the entropy of a specific sentence from the vector of probabilities.

For example, once trained on a specific model, I want to compare the entropy of "The computer is on the desk" and "The pen is black and old", to know which one is more likely.

Thanks!

So you already have the vector of probabilities for each of those models, and you want to compute the entropy for a specific set of tokens? you can compute it with Shannon's measure of entropy H(Sentence) = - sum( P(token_i) * log( P(token_i) ) ) for i in n, where n is the number of tokens in the sentence. [Source](https://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition) — , Sep 12 '18 at 07:45
@DanielR. No, I am familiar with how to measure entropy - I want to know which of the aforementioned modules is the best for my case to get vector of probabilities? — okuoub, Sep 12 '18 at 07:50

score 1 · Accepted Answer · answered Sep 14 '18 at 00:28

TL;DR

Since you are interested in the entropy (or perplexity) of a sentence, I would definitely prefer the KneserNeyProbDist since it is especially designed for N-gram smoothing.

Their differences

All the probability models you mentioned here is to estimate a probability distribution given a sample of data, represented by a counter (or a histogram) class called FreqDist. Their key differences are about how to do smoothing, i.e. how to account for unseen data. This is important in NLP because of the many distributions follow the Zipf's law, and out-of-vocabulary word / n-gram constantly appears.

You can easily find how those methods do smoothing from their documentation. For easy reference, see below. The first group of distributions are similar:

MLEProbDist: basically no smoothing, just do a MLE estimation, resulting in the empirical distribution. Unseen words are assigned probability 0.
ELEProbDist: adding 0.5 to all counts (including those possibly unseen words, whose original count is 0), then MLE.
LaplaceProbDist: adding 1 to all counts, then MLE
LidstoneProbDist: adding gamma to all counts, then MLE. For instanc if you specify gamma=1 then the result is same as LaplaceProbDist.

The different ones are:

WittenBellProbDist: You need to specify how many possible words/data types by bins, even though not all those words/data types appear in the corpus
RandomProbDist: just random, nothing interesting.
HeldoutProbDist: you need two counters (FreqDists). I don't know what it does exactly.

Example with unigrams

Consider the following sentence:

"I like python but I do n\'t like java . John like c but he does n\'t like java . "

Based on the sentence above, we use the different models to estimate for the following three tokens: I, python and c++. (Note that c++ does not appear in the training sentence above.)

UNIGRAM:            I       python  c++
frequency:          2       1       0
empirical:          0.095   0.048   0.000
-----------------------------------------
MLE                 0.095   0.048   0.000
ELE                 0.091   0.055   0.018
Laplace             0.088   0.059   0.029
Lidstone(gamma=1)   0.088   0.059   0.029
Lidstone(gamma=2)   0.085   0.064   0.043
WittenBell(bins=20) 0.059   0.029   0.055

The best one for your purpose

For the above ones, you can choose what to count in the FreqDist, either tokens (a.k.a. unigrams, as the example above), or bigrams or trigrams. The exception is KneserNeyProbDist, which only accepts trigrams:

KneserNeyProbDist: use Kneser–Ney smoothing, almost the best-performing technique for n-gram smoothing. For more details see this tutorial.

In my opinion, as long as you have a large corpus, the KneserNeyProbDist should be the way to go. Modeling a sentence as trigrams should be more meaningful than just unigrams, and KneserNeyProbDist is more suitable for trigrams and usually achieves much lower perplexity than unigram models.

do you think this method is better then get probabilities using RNN/LSTM model? — okuoub, Sep 17 '18 at 09:26
@okuoub No, If you mean "probabilities" as in language modeling, then RNN-based model is better than ngram based ones. — user12075, Sep 17 '18 at 16:15

NLP various probabilities estimators in nltk

1 Answers1

TL;DR

Their differences

Example with unigrams

The best one for your purpose