How does the beta prior - binomial conjugate theory hold in classifiers?

Question

In our machine learning class, we were given an example of a naive bayes classifier. Say, you classify a day as being good/bad depending on 2 conditions (the "X" input) - weather(X1 - hot/cold) and wind(X2 - high/low). Using bayes theorem and the naive assumption- $$P(Y = good | X1X2) = \frac{P(X1X2|Y)*P(Y)}{P(X)} = \frac{P(X1|Y)*P(X2|Y)*P(Y)}{P(X)}$$

Now, assuming X|Y follows a binomial distribution, we're told the prior conjugate is beta. However, isn't this prior distribution on X itself and not on Y? Y here is just a category - how does P(Y) denote a prior distribution of X? I understand this in the coin flip case, where we talk about a parameter $$ \theta = P(head) $$ and how it follows that $$ P(\theta|D) \propto P(D|\theta)*P(\theta)$$

So, we get an overall beta distribution. However, Y is not a parameter of X - how does its distribution allow us to get MAP estimates. I'd be greatfull if someone could please explain this to me?

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

I'm afraid you are mixing two different things. Naive Bayes algorithm uses Bayes theorem for classifying $Y$ given some data $X$ using marginal $p(Y=k)$ and conditional $p(x_i \mid Y = k)$ probabilities:

$$ \DeclareMathOperator*{\argmax}{arg\,max} \hat y = \argmax_{k\in \{1,\dots,K\}} \Big\{ p(Y=k) \prod_i p(x_i \mid Y = k) \Big\} $$

On another hand, in your beta-binomial example, you refer to Bayesian model that assigns prior to unknown parameter value $\theta$ to estimate it. This is a well-known beta-binomial model, where Bayes theorem is applied to obtain posterior distribution of $\theta$:

$$ p(\theta|X) \propto p(\theta) \, p(X|\theta) $$

and if you are interested in point estimate, you simply take

$$ \hat\theta= \argmax_{\theta\in\Theta} \Big\{ p(\theta) \, p(X|\theta) \Big\} $$

where $p(\theta)$ is a beta prior distribution for possible values of $\theta$ and $p(X|\theta)$ is the likelihood, that in this case is a binomial distribution of $X$ parametrized by $\theta$.

So in the first case you do not use any priors (what makes it a non-Bayesian approach) and use only empirical probabilities for classification. In the second case, you assume some prior distribution for the unknown parameter, to estimate it. To use the second approach for classification you would need to introduce some decision rule to translate what you have learned about your model and it's parameters to actual predictions.

So you are right, $Y$ is not a parameter and we do not assign priors to it in the case of naive Bayes algorithm. Those are two different things, yet both use Bayes theorem.

As about the question in your title, Bayesian approach can be used and is used in machine learning, but this is a very broad topic. To learn more you can refer, for example, to the following book:

Murphy, K.P. (2012). Machine learning: a probabilistic perspective. MIT press.

How does the beta prior - binomial conjugate theory hold in classifiers?

1 Answers1