4

Squared loss for linear regression corresponds to a MLE of a Gaussian model, and cross entropy loss corresponds to MLE of a logistic model with discrete probabilities.

Can zero-one loss be justified by MLE of any probability model? Why or why not?

That's the main question. This question also has to do with intuition for zero-one loss, but doesn't have anything to do with MLE. but also I'd like to state some things I believe are true about zero-one loss, and please correct me if I'm wrong:

  • MAP is Bayes optimal for zero-one loss
  • Logistic regression models probabilities directly, and classifies to the largest, but its parameters are optimized based on maximum likelihood (leading to cross-entropy) rather than optimizing zero-one loss directly
  • The performance of logistic regression could be improved by using zero-one loss, or you could ditch logistic regression entirely and use frequency tables to approximate $P(\text{class} | X)$, though both of these options are entirely intractable
Drew N
  • 490
  • 3
  • 10
  • 3
    Can you explain why you would think logistic regression could be improved using zero-one loss? – Cliff AB Apr 24 '19 at 23:03
  • I believe the performance of logistic regression *as a classifier* would improve if we optimize by zero-one loss, since its performance as a classifier is measured using zero-one loss. – Drew N Apr 25 '19 at 04:23
  • Zero-one loss is usually too hard to optimize, which is why surrogate losses are always used for training. – Frans Rodenburg Apr 29 '19 at 13:54

2 Answers2

3

Minimizing the zero-one loss $$ L(\theta, \widehat{\theta}) = 1 - \mathbf{1}_{\{\theta\}}(\widehat{\theta}) = \begin{cases} 0 & \text{if $\theta = \widehat{\theta}$} \\ 1 & \text{otherwise} \end{cases} $$ is the same as maximum likelihood estimation with a family of Dirac distributions: discrete distributions with probability mass function $$ p(x \mid \theta) = \mathbf{1}_{\{\theta\}}(x) = \begin{cases} 1 & \text{if $x = \theta$} \\ 0 & \text{otherwise.} \end{cases} $$ Here $\mathbf{1}_A$ denotes the indicator function of a set $A$.


Regarding your beliefs about zero-one loss:

MAP is Bayes optimal for zero-one loss

Consider a parametric family of distributions $\{P_\theta : \theta \in \Omega\}$ and a prior $\pi$. The risk of any estimator $\widehat{\theta}$ of $\theta$ is $$ \begin{aligned} R(\theta, \widehat{\theta}) &= E_\theta[L(\theta, \widehat{\theta})] \\ &= E_\theta[1 - \mathbf{1}_{\{\theta\}}(\widehat{\theta})] \\ &= P_\theta(\widehat{\theta} \neq \theta), \end{aligned} $$ and so the Bayes risk is $$ \begin{aligned} R_\pi(\widehat{\theta}) &= \int_\Omega R(\theta, \widehat{\theta}) \, d\pi(\theta) \\ &= \int_\Omega P_\theta(\widehat{\theta} \neq \theta) \, d\pi(\theta). \end{aligned} $$ A priori it is not clear to me why a MAP estimator $\widehat{\theta}$ would minimize this Bayes risk in this general setup.

Logistic regression models probabilities directly, and classifies to the largest, but its parameters are optimized based on maximum likelihood (leading to cross-entropy) rather than optimizing zero-one loss directly

This is correct.

The performance of logistic regression could be improved by using zero-one loss, or you could ditch logistic regression entirely and use frequency tables to approximate P(class|X), though both of these options are entirely intractable

Zero-one loss is not differentiable and not convex, so the tradiitonal optimization methods used to train models like logistic regression don't apply to it. The problem with frequency tables is that as soon as you have a point X that isn't in the training data, you can't make any inference about P(class|X) as you would like.

Artem Mavrin
  • 3,489
  • 1
  • 16
  • 26
  • That's a good point about frequency tables, and it also applies to Naive Bayes. What they do to avoid zero probabilities there is add a pseudocount to both the numerator and denominator, called "additive smoothing". – Drew N Apr 25 '19 at 22:02
  • I looked over some of my reading and I can answer why MAP yields the Bayes decision rule in a prediction problem. Your answer is phrased for estimation problems; I didn't realize at the time my question could be interpreted both ways. I'll post my own answer tying those together later. – Drew N Apr 25 '19 at 22:07
2

In general surrogate losses used in place of 0-1 loss are chosen to be Fisher consistent. There are established results regarding this matter:

http://statistics.berkeley.edu/sites/default/files/tech-reports/638.pdf

This means that optimization using the surrogate loss will yield the model that will also minimize the expected 0-1 loss.

Cagdas Ozgenc
  • 3,716
  • 2
  • 29
  • 55