8

I'm currently implementing a Gaussian Naive Bayes classifier. Of course if I'm doing classification by

$$ \text{argmax}_{C_i} P(C_i)P(D|C_i), $$

then the probabilities can get very small. So I want to use log probabilities. I'm seeing three posibilities:

$$ \text{argmax}_{C_i} P(C_i)\log P(D|C_i), $$

$$ \text{argmax}_{C_i} \log P(C_i) \log P(D|C_i), $$

$$ \text{argmax}_{C_i} \log P(C_i) + \log P(D|C_i), $$

Which of them are the correct way to go? From a calculation point of view the second one is the right because for the others I'm getting negative values but from a math point of view the third one is the right due to the following:

$$ P(C_i|D) = \frac{P(C_i)P(D|C_i)}{P(D)} = P(C_i)P(D|C_i) $$

$$ \log P(C_i|D) = log[P(C_i)P(D|C_i)] = \log P(C_i) + \log P(D|C_i) $$

P(D) can be dropped because it does not depend on the class. Anyway for all variants I'm getting values outside [0,1] but I think this is ok because I'm calculating probability densitiy (from Gaussian distribution) and not probability.

I have a second question. I'm also interested in getting the importance for each feature for each pair of classes. How could this be calculated based on Gaussian Naive Bayes? I need this because I want to visualize the 10 most important features for each pair of classes.

machinery
  • 1,474
  • 4
  • 18
  • 30

1 Answers1

7

The third option is right one. In general, it is true that: $$ \log(ab) = \log(a) + \log(b)$$ Plugging in the Naive Bayes equation, you get $$ \log(P(\text{class }_i| \textbf{ data})) \propto \log(P(\text{class}_i)) + \sum_j \log(P(\textrm{data}_j|\text{class}_i))$$

This value may be negative. If your all of your terms were actual probabilities, they'd be between zero and one, so the logs would all be between $- \infty$ and zero, as would their sum. In fact, you should be concerned if you see a positive log-probability. We often sashay around this fact by calculating the negative log-likelihood of something, which removes the minus-sign.

This doesn't necessarily hold if you're throwing probability densities into the mix, since those values can be larger than 1.


There are a few posts about determining variable importance in Naive Bayes (e.g., this one), so you may want to start there....
Matt Krause
  • 19,089
  • 3
  • 60
  • 101
  • Hi, I am facing an issue with modeling the log-space probability for Naive Bayes. Your answer makes a lot of sense but what brought me here is the mere accuracy rate of 24.44 % with around 250k observations. It doesn't seem right to me. Could you please comment on the accuracy? I know it's not possible to say what could have gone wrong with such minimum information still any remark is highly appreciated. Thanks! – aradhak Oct 07 '16 at 00:29
  • 1
    Hmm...it's really hard to say and partly problem-dependent: if I were predicting winning lottery numbers 25% of the time, I'd be thrilled, but 25% on Fisher's Iris would be alarming. One thing to check would be the actual arithmetic. Floating-point calculations with a mix of many very big small values sometimes yield results that are surprisingly far from the "real" answer. I'd also check to make sure that none of the features were "blowing up" and assigning no/very low probability (or, in log-space, $-\infty$) to all classes. This can easily swamp the "good" features. – Matt Krause Oct 07 '16 at 16:17
  • Thanks a lot for the response. Now I think the accuracy shouldn't come as a surprise since I am seeing a lot of −∞ in my prediction. I was too dumb to realize that! :D A bit about the dataset: Using the 20newsgroup dataset from http://qwone.com/~jason/20Newsgroups, I am trying to predict the newsgroup to which any newly introduced document would belong to. I am using R (which I am completely new to) and I suppose there is a huge chance that I would be messing up the datatypes. Thanks for pointing in the right direction. I will work a bit more on the arithmetic part. – aradhak Oct 07 '16 at 19:07