What is the best form (Gaussian, Multinomial) of Naive Bayes to use with categorical (one-hot encoded) features?

Question

I've been asked to use the Naive Bayes classifier to classify a couple of samples.

My dataset had categorical features so I had to first encode them using a one-hot encoder, but then I was at a loss as for which statistical model to use (e.g. Gaussian NB, Multinomial NB).

I ended up using the multinomial version because I read somewhere that it worked well in NLP and IR tasks due to documents being represented as term-count vectors or TF-IDF weights.

I would like to know if that was correct and, if possible, a quick explanation on why that is so.

PS There is this somewhat similar question, but I'm not sure whether that also applies to strictly binary (0 or 1) feature vectors.

Naive Bayes seems like a strange choice to me because it assumes independence between features. But, the features of a one-hot encoding are heavily dependent because only one of them can be nonzero. In this sense, one-hot encodings are different than word count vectors. — user20160, May 29 '16 at 06:01
@user20160 I agree, but how else would you encode categorical (i.e. words) features in order to use them in a Naive Bayes classifier, supposing you need to use that classifier? — Felipe, May 29 '16 at 21:53

Michael · Answer 1 · 2018-09-22T21:26:41.570

As others mentioned, there isn't a "right" model. However, since you used one-hot encoding, you are basically dealing with boolean features now. In other words each term/feature is following a Bernoulli distribution. That being said, I would use a multivariate Bernoulli NB or a multinomial NB with boolean features (which you already have). Gaussian NB seems a bit off here since you don't deal with real-valued features.

This excellent paper has a lot of information on different NB variants and when to use which.

redress · Answer 2 · 2017-05-24T01:06:46.387

Your choice of statistical model in classification (Gaussian NB, Multinomial NB, etc) depends on the distribution of your input variables. You should plot the histogram of each input parameter in order to determine their distribution.

You can use Pandas to do this by creating a dataframe on your input matrix and running .hist() on it, as follows:

X_frame = pd.DataFrame(X, index=natural_index(dataset))
X_frame.hist()

score 0 · Answer 3 · answered May 24 '17 at 02:21

If you're using real-world data, it's very unlikely that any model will be "right," so rather than try to find a model that is "right," you should try to find a model that is accurate. To decide between those two models, you can use cross validation to get an estimate of the accuracy of each model and choose the better one. At the end of the day, you can't be be confident about which model will perform best on your data without actually running the models on your data in some capacity, even if one model is used in similar applications.

I would also suggest a third Naive Bayes model that you could try. Instead of using a one-hot encoder, let the class-conditional density of each feature be a categorical distribution.

More precisely, suppose $Y_i \in \{1, ..., C\}$ is the label for data point $i$. Suppose $X_i$ is the data for data point $i$ and suppose that each feature is $X_{ij} \in \{1, ..., K\}$. In other words, suppose that each feature is categorical with $K$ values. You could use the model $P(X_{ij} = k|Y_{i} = c, \theta) = \theta_{cjk}$ where $\forall c \forall j$, $\sum_{k=1}^K \theta_{cjk} = 1$.

score 0 · Answer 4 · answered Oct 09 '18 at 16:48

I would suggest to plot a histogram. For a quick histogram you can do this:

Load data into a pandas dataframe: df = pandas.Dataframe( data, optional parameters)

df.hist()

If most of your features are following a bernoulli distribution , you should be good to use Multinomial (Bernoulli) NB and if they are following a Gaussian (Normal) distribution, Gaussian Bayes should be good.

In case your distributions of features seems to be complex (a mixture of different distributions), it would be good to consider dimensionality reduction to make sure you have most, though not all, features to have similar distribution.

What is the best form (Gaussian, Multinomial) of Naive Bayes to use with categorical (one-hot encoded) features?

4 Answers4