TL;DR
Since you are interested in the entropy (or perplexity) of a sentence, I would definitely prefer the KneserNeyProbDist
since it is especially designed for N-gram smoothing.
Their differences
All the probability models you mentioned here is to estimate a probability distribution given a sample of data, represented by a counter (or a histogram) class called FreqDist
. Their key differences are about how to do smoothing, i.e. how to account for unseen data. This is important in NLP because of the many distributions follow the Zipf's law, and out-of-vocabulary word / n-gram constantly appears.
You can easily find how those methods do smoothing from their documentation. For easy reference, see below. The first group of distributions are similar:
MLEProbDist
: basically no smoothing, just do a MLE estimation, resulting in the empirical distribution. Unseen words are assigned probability 0.
ELEProbDist
: adding 0.5 to all counts (including those possibly unseen words, whose original count is 0), then MLE.
LaplaceProbDist
: adding 1 to all counts, then MLE
LidstoneProbDist
: adding gamma
to all counts, then MLE. For instanc if you specify gamma=1
then the result is same as LaplaceProbDist
.
The different ones are:
WittenBellProbDist
: You need to specify how many possible words/data types by bins
, even though not all those words/data types appear in the corpus
RandomProbDist
: just random, nothing interesting.
HeldoutProbDist
: you need two counters (FreqDist
s). I don't know what it does exactly.
Example with unigrams
Consider the following sentence:
"I like python but I do n\'t like java . John like c but he does n\'t like java . "
Based on the sentence above, we use the different models to estimate for the following three tokens: I
, python
and c++
. (Note that c++
does not appear in the training sentence above.)
UNIGRAM: I python c++
frequency: 2 1 0
empirical: 0.095 0.048 0.000
-----------------------------------------
MLE 0.095 0.048 0.000
ELE 0.091 0.055 0.018
Laplace 0.088 0.059 0.029
Lidstone(gamma=1) 0.088 0.059 0.029
Lidstone(gamma=2) 0.085 0.064 0.043
WittenBell(bins=20) 0.059 0.029 0.055
The best one for your purpose
For the above ones, you can choose what to count in the FreqDist
, either tokens (a.k.a. unigrams, as the example above), or bigrams or trigrams. The exception is KneserNeyProbDist
, which only accepts trigrams:
In my opinion, as long as you have a large corpus, the KneserNeyProbDist
should be the way to go. Modeling a sentence as trigrams should be more meaningful than just unigrams, and KneserNeyProbDist
is more suitable for trigrams and usually achieves much lower perplexity than unigram models.