Why is the naive bayes classifier optimal for 0-1 loss?

Question

The Naive Bayes classifier is the classifier which assigns items $x$ to a class $C$ based on the maximizing the posterior $P(C|x)$ for class-membership, and assumes that the features of the items are independent.

The 0-1 loss is the loss which assigns to any miss-classification a loss of "1", and a loss of "0" to any correct classification.

I often read (1) that the "Naive Bayes" Classifier, is optimal for the 0-1 loss. Why is this true?

(1) One exemplary source: Bayes classifier and Bayes error

Can you provide a reference for your statement, "*I often read that the "Naive Bayes" Classifier, is optimal for the 0-1 loss*"? Like, where may have you read this type of statement in the past — Jon, Aug 03 '17 at 15:44

Tim · Accepted Answer · 2017-08-23T08:12:17.417

Actually this is pretty simple: Bayes classifier chooses the class that has greatest a posteriori probability of occurrence (so called maximum a posteriori estimation). The 0-1 loss function penalizes misclassification, i.e. it assigns the smallest loss to the solution that has greatest number of correct classifications. So in both cases we are talking about estimating mode. Recall that mode is the most common value in the dataset, or the most probable value, so both maximizing the posterior probability and minimizing the 0-1 loss leads to estimating the mode.

If you need a formal proof, the one is given in the Introduction to Bayesian Decision Theory paper by Angela J. Yu:

The 0-1 binary loss function has the following form:

$$ l_\boldsymbol{x}(\hat s, s^*) = 1 - \delta_{\hat ss^*} = \begin{cases} 1 & \text{if} \quad \hat s \ne s^* \\ 0 & \text{otherwise} \end{cases} $$

where $\delta$ is the Kronecker Delta function. (...) the expected loss is:

$$ \begin{align} \mathcal{L}_\boldsymbol{x}(\hat s) &= \sum_{s^*} l_\boldsymbol{x}(\hat s, s^*) \; P(s = s^* \mid \boldsymbol{x}) \\ &= \sum_{s^*} (1 - \delta_{\hat ss^*}) \; P(s = s^* \mid \boldsymbol{x}) \\ &= \sum_{s^*} P(s = s^* \mid \boldsymbol{x}) ds^* - \sum_{s^*} \delta_{\hat ss^*} P(s = s^* \mid \boldsymbol{x}) \\ &= 1 - P(s = s^* \mid \boldsymbol{x}) \end{align} $$

This is true for maximum a posteriori estimation in general. So if you know the posterior distribution, then assuming 0-1 loss, the most optimal classification rule is to take the mode of the posterior distribution, we call this a optimal Bayes classifier. In real-life we usually do not know the posterior distribution, but rather we estimate it. Naive Bayes classifier approximates the optimal classifier by looking at the empirical distribution and by assuming independence of predictors. So naive Bayes classifier is not itself optimal, but it approximates the optimal solution. In your question you seem to confuse those two things.

I think I understand: So the formal proof would be something along the lines of Loss(action_1) = 1-P(action_2 | data) — , Aug 03 '17 at 15:38
That is the most complicated formalism I have seen for such a proof:)) thank you however, I hope it helps others as well. — , Aug 03 '17 at 15:53

Why is the naive bayes classifier optimal for 0-1 loss?

1 Answers1

Linked