3

Given a logistic regression model:

$y \in \{0, 1\}$

$ P(y=1|x;\theta) = h_{\theta}(x) = \frac{1}{1+\exp(-\theta^T x)}$

And given the value $\theta^*$ which maximises the conditional likelihood $P(y|X; \theta)$:

It seems to me that, given a new training example $x$, I should calculate the predicted value as:

$ y^*|x; \theta^* = \textbf{1} \{\frac{1}{1+\exp(-\theta^{*T} x)} > 0.5 \} $

However a well known online ML course (page 3) purports that the prediction rule is:

$ y^*|x; \theta^* = \textbf{1} \{\theta^{*T}x > 0 \} $

These two rules don't agree on e.g. the trivial case $x \in \mathbb{R}, x =0$. Which is correct?

BHC
  • 141
  • 1
  • 3

2 Answers2

15

They do agree. One deals with a probability of $p=0.5$. The other deals with a log-odds of $0$.

$$ \log\bigg(\dfrac{p}{1-p}\bigg)=\log(1)=0 $$

Importantly, though, logistic regression alone is not a classification method, there’s nothing special about using $0.5$ probability as a cutoff threshold, and methods like logistic regression are best-evaluated on their probability outputs rather than threshold-based metrics (e.g., accuracy, sensitivity, specificity, F1 score).

https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

Dave
  • 28,473
  • 4
  • 52
  • 104
  • 8
    +1 for fighting against arbitrary cuttoffs – Demetri Pananos Mar 05 '21 at 15:56
  • Thanks - I can now see that they are equal. Can you elaborate on the problem of using 0.5 as the threshold? Clearly in this case that threshold implies the most likely class label. Are you arguing generally against the use of any thresholding? I agree that for model evaluation it is not suitable, but if one wanted to build a predictive model using only logistic regression, the 'end state' would have to be based on probability thresholding (at 0.5 for a binary classification problem). – BHC Mar 05 '21 at 16:03
  • 1
    @BHC I recommend reading the two links to Frank Harrell (Vanderbilt professor, founder of their biostatistics department) and the “spam vs ham” link. For the latter, I think very highly of the person who asked the question ;) The gist is that thresholding discards information, causing users to pick what Harrell has described as “bogus” models. – Dave Mar 05 '21 at 16:08
  • 1
    Metrics like accuracy, sensitivity etc. again are quite arbitrary. Proper scoring rules is a sounder choice. – Richard Hardy Mar 06 '21 at 13:17
0

Of course, they agree.

We know that always, $\theta ^{*T} \in R $.

Your condition is:

$$ \frac{1}{1+e^{-\theta^{*T} x}} > 0.5 $$ $$ \implies \frac{1}{1+e^{-\theta^{*T} x}} > \frac{1}{2} $$ $$ \implies 2>1+e^{-\theta^{*T} x} $$ $$ \implies 1>e^{-\theta^{*T} x} $$ $$ \implies e^{0}>e^{-\theta^{*T} x} $$ Since $ y=e^{x} $ is always strictly increasing, we can conclude from the above inequality that $$ \implies 0>-\theta^{*T} x $$ $$ \implies 0<\theta^{*T} x $$ Therefore $$ \theta^{*T} x>0 $$

And yes, as others pointed out, it is not advisable to have thresholds and include the decision-making in your model. Because, you would just be losing data(say, of uncertainty) by doing that. For example, if you get the $P(y=1)=0.990$, you are not only classifying it as 1, but you are very(99%) sure that it is 1. This is not exactly the case if you get $P(y=1)=0.510$, because you are probably in a dilemma here. But if you only consider only the value of the final classification and not bother about the probability as long as it is greater than 0.5(or some other threshold), you would lose some information you get from the model.

  • Since classification represents an arbitrary forced choice and is inconsistent with optimum decision making, I'm not sure why we are spending so much time showing how to do it. – Frank Harrell Mar 11 '21 at 12:39