3

I'm working with Multinomial and Bernoulli Naive Bayes implementation of scikit-learn (python) for text classification. I'm using the 20_newsgroups dataset. From the scikit documentation we have:

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

and

class sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)

so we need to give a float value to alpha, which represents the smoothing parameter (as scikit says: "setting alpha = 1 is called Laplace smoothing, while alpha < 1 is called Lidstone smoothing").

Now, I noticed the Multinomial version works pretty well with a Laplace smoothing (alpha=1.0), while the Bernoulli one seems pretty bad with such a value. I tried to give different values to the Bernoulli alpha and noticed Bernoulli had acceptable accuracy if alpha was something like "0.01", "0.03", "0.001"...ecc. So I thought Bernoulli Naive Bayes "prefers" a Lidstone smoothing. Now, my question is: is it always like that (alpha=1.0 with Multinomial and alpha=0.01 with Bernoulli) or is the value for the smoothing parameter related to the particular structure of the dataset we're using?

usεr11852
  • 33,608
  • 2
  • 75
  • 117
Trevor
  • 31
  • 1
  • 4
  • 1
    Clearly the smoothing parameter is a hyper-parameter. And, you tune it to its best value for the task at hand. So, yes it is related to the dataset – naive Apr 03 '19 at 18:34

1 Answers1

1

It's not related to the "structure", it's related to the level of certainty that the relative count for a given case in your data is a correct estimation of its probability. (By "relative count" I mean the rate: Number of occurrences divides by total number of examples in the dataset.)

Consider a dataset with a few features and a label that is "positive" or "negative". Let's say the positives and negatives are split 50/50. One of the features, called F, is 0 for all occurrences of "positive". What is the probability to get a positive, given that F=1?

P("positive" | F=1)  =  P(F=1 | "positive") * P("positive") / P(F=1)

Given that there are no positives in your dataset for F=1, you can set P(F=1|"positive") to 0. Or you can argue that your dataset is not infinite, and that the real probability is greater than 0. If you believe that, you should set alpha > 0.

Paul
  • 111
  • 3