I'm using the .show_most_informative_features() function from NLTK's Naive Bayes to generate features to be used with a lexicon. In the case of my binary-classification problem, these features are calculated as (where W = feature and V = class): $$ MAX(P(W|V^1) / P(W|V^2) $$
I just want to make sure that I understand how NLTK's NB calculates the probability of a feature belonging to a particular class. Essentially, I need someone to inform me how this is calculated:
$$ P(W|V^i) $$
I believe the type of NB NLTK uses is MultiNomial. An explanation sourced from this stackoverflow post:
The probability of a word given the tag is computed in the train() function using the Expected Likelihood Estimation from the ELEProbDist which is a LidstoneProbDist object under the hood where the gamma argument is set to 0.5, and it does:
class LidstoneProbDist(ProbDistI):
"The Lidstone estimate for the probability distribution of the experiment used to generate a frequency distribution. The "Lidstone estimate" is parameterized by a real number gamma, which typically ranges from 0 to 1. The Lidstone estimate approximates the probability of a sample with count c from an experiment with N outcomes and B bins as
c+gamma)/(N+B*gamma)
. This is equivalent to adding gamma to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution."
How is this explanation represented as a step-by-step mathematical process?
Detailed explanation on stack. NLTK NB documentation. NLTK Probability documentation.
Sample Output:
Most Informative Features
outstanding = 1 pos : neg = 13.9 : 1.0
insulting = 1 neg : pos = 13.7 : 1.0
vulnerable = 1 pos : neg = 13.0 : 1.0
ludicrous = 1 neg : pos = 12.6 : 1.0
uninvolving = 1 neg : pos = 12.3 : 1.0
astounding = 1 pos : neg = 11.7 : 1.0