Naive Bayes with Density estimation

Question

The way I understand the Naive Bayes estimators is that the characteristic (or naive) assumption is that all Features are conditionally independent.

Now on top of this assumption, for continuos features it is also assumed that they are Gaussian distributed. My question is whether this is something inherent in the Naive Bayes estimator or why do we not just use a non-parametric density estimation to determine the one dimensional probability distributions of the features.

score 2 · Answer 1 · answered Apr 29 '13 at 22:06

The Guassian assumption is just that -- an assumption. If it holds, great, and you'll have a low-complexity estimate of the joint pdf. But if the assumption does not hold, you will have to use a more complex estimator, like the nonparametric one.

For example, suppose that you want to approximate a joint pdf over two variables (X,Y), i.e., $p(x,y)$. Now, if you cannot make any assumption about the pdf, then best you can do is use a multivariate KDE or some other estimator to approximate the true $p(x,y)$. If you can assume that x and y are conditionally independent (i.e., $p(x|y)=p(x)$), then you have $p(x,y)=p(x|y)p(y)=p(x)p(y)$. This means that you can separately estimate $p(x)$ and $p(y)$ by some estimator and just multiply them. If these two distributions are multimodal, then you probably need to use the KDE. However, if you can assume that each of the pdfs can be well approximated by a single Gaussian, then you can approximate your joint pdf $p(x,y)\approx \mathcal{G}(x| \mu_x, \sigma_x) \mathcal{G}(y| \mu_y, \sigma_y)$, where $\mathcal{G}(x| \mu_x, \sigma_x)$ and $\mathcal{G}(y| \mu_y, \sigma_y)$ are just 1D Guassians.

Now, if you have many more than two variables, say 10000. Then you probably will not want to estimate a joint KDE over these (although you might after reducing the dimensionality by PCA). And the only computationally feasible solution is often just assuming that all variables are conditionally independent, and the pdf along each variable can be approximated well by a single Gaussian. So there's the motivation for the extreme simplification you were talking about. Hope this answers your question.

Naive Bayes with Density estimation

1 Answers1