1

We have the following loss function for logistic regression, the so called log-loss defined as:

$- \Big[\sum_i y^{i}\log(h(x^i))+(1-y^i)\log(1-h(x^i))\Big]$

We also know that logistic regression assigns a datasample to class y=1 if the posterior probability $h(x)$ of class $y=1$ is bigger than 0.5.

Now my question: The term $y^{i}\log(h(x^i))$ quantifies the case where the true label is "$y=1$", but the prediction is "$y=0$". The prediction "$y=0$" is however only done when $h(x)<0.5$. Does this mean that the $h(x^i)$ in $y^{i}\log(h(x^i))$ always will be $<0.5$?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Pugl
  • 951
  • 1
  • 16
  • 40

1 Answers1

3

We also know that logistic regression assigns a datasample to class y=1 if the posterior probability p of class y=1 is bigger than 0.5.

This is not true; logistic regression is not a classifier. But the notation here is a little confusing because $p$ does not appear in your expression for the loss.

The term $y^{i}\log(h(x^i))$ quantifies the case where the true label is "$y=1$", but the prediction is "$y=0$".

This is not true. The way to think about the log-loss function is that $y^i$ works as a "switch." If $y^i=1$, then the term $\log(h(x^i))$ is added to the loss; if $y^i=0$, the term $\log(1-h(x^i))$ is added to the loss.

The prediction "$y=0$" is however only done when $h(x)<0.5$.

When we're considering the log-loss, at no point do we consider whether or not $h(x^i)>0.5$. If $y^i=1$ but $h(x^i) < 0.5$, even if $h(x^i)$ is very small, the expression is evaluated as it is written. Very small values $h(x^i)$ naturally imply that the loss contribution for the sample will be very large, which is exactly what we want because in that case, the model poorly predicts the sample.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Edited notation (replaced p with h(x)). I understand the "switch" function notation (basically it works like an indicator function). So basically h(x) can take any value, but we penalize according to the true label (and if for instance y = 1, but h(x) is close to 1 as well, log(h(x)) will accordingly be small. Does then the classification come in a later separate step, and different classification rules could have different optimality properties? Would eg classifying according to the 0.5 threshold make it a bayes-optimal classifier? And thanks! – Pugl Aug 03 '17 at 16:38
  • 1
    More precisely, $h(x)$ must take values in $(0,1)$. If you **need** to put your results into discrete categories, the choice of a decision value depends on the costs of misclassification. Consider the decision a doctor might make -- incorrectly prescribing antibiotics (you think the patient has an infection when s/he doesn't) has a low detriment to a patient's quality of life than amputating a limb unnecessarily. So you have to consider the context of your decision. Whether or not a decision is Bayes-optimal is beyond the scope of this question/comment, but would be a fine question on its own! – Sycorax Aug 03 '17 at 16:42
  • 1
    Thank you, you helped me clear my mind. I will post the question regarding optimality too! /P – Pugl Aug 03 '17 at 16:46
  • Sycorax, I have a question regarding a concrete exercise about log regression and bayes optimality - can I link you here to it? I would really appreciate your answer, as you explain clearly and that would save me some time https://stats.stackexchange.com/questions/296163/bayes-optimal-decision-for-logistic-regression-self-study-exercise – Pugl Aug 04 '17 at 10:21