Naive Bayes using 1d features vs one n-D features with Kernel Density Estimation - Independence assumption

Question

Given a set of features $x_1,x_2,x_3, ... \in \mathbb{R}$ and output class variable $y \in \mathbb{R}$

I could do Naive Bayes using the independence assumption of $x_1, x_2, x_3, ...$ to predict the output class probability as:

$$p(y \mid x_1, x_2, x_3...) = \frac{P(y) P(x_1 \mid y) P(x_2 \mid y) P(x_3 \mid y) ...}{P(x_1) P(x_2) P(x3)...}$$

where I could build 1d histograms from my data to compute the probability density functions for the right hand side PDFs.

Or I could also consider $X = [x_1, x_2, x_3, ...]$ as a n-dimensional variable and use:

$$P(y \mid X) = \frac{p(y) P (X \mid y)}{P(X)}$$

where I can use something like kernel density estimation (KDE) to compute the right hand side PDFs.

It looks like I do not need to assume the independence of $x_1, x_2, x_3, ...$ for the second approach? Is that correct or is the independence assumption somehow absorbed into the KDE process? If not can the latter be called Naive Bayes anymore or would it be just Bayes?

Also what would be the pros and cons of either approach to compute the posterior probability of the class $y$?

score 1 · Answer 1 · answered Jun 30 '16 at 05:32

You're correct that kernel density estimation doesn't assume independence of the features. This could be an advantage because it's more powerful; it can model more complicated patterns that a naive Bayes classifier cannot. However, it can also be a downside because it will be much more prone to overfitting. As the number of dimensions grows, you'll need exponentially many data points to obtain an accurate estimate. In high dimensions, standard kernel density estimation approaches are not practically feasible.

Naive Bayes classification has the opposite set of advantages and problems. By assuming independence, the model is much simpler, less prone to overfitting, and requires less data. However, it's also less powerful (i.e. more prone to underfitting). If your features truly are independent, then this is a good choice of model. If not, naive Bayes might still give reasonable results, but there are many circumstances where it can't reasonably model the underlying distribution.

Thanks for the input. For my problem I am looking at about 4 features and about 20million training samples. In this light would the 4-dimensional Kernel Density Estimation based approach be a better choice over the Naive Bayes approach with individual (hopefully independent) 1-d features? Scatter plots of the feature pairs don't show any apparent linear relationship. — user3713338, Jun 30 '16 at 05:57
That seems like a good number of data points and not too many features. The critical issue is how many points per class, since that's how many points you'd use to fit each KDE. If you don't have too many classes, I'd say (totally heuristically) that you could give KDE a shot. Discussion here may be of interest (improving computational performance of KDEs on large data sets): http://stats.stackexchange.com/questions/219833/density-estimation-for-large-dataset/219913. You might try naive Bayes in parallel, and compare the results (cross validated, of course). — user20160, Jun 30 '16 at 06:21

Naive Bayes using 1d features vs one n-D features with Kernel Density Estimation - Independence assumption

1 Answers1