10

I am having trouble to understand the loss function scikit-learn uses to fit logistic regression, which can be found here.

Specifically I have problem with the second term. It seems very different from the usual MLE criterion. Can someone give me some hint where this comes from?

$$\mathop {\min{\mkern 1mu} }\limits_{w,c} \frac{1}{2}{w^T}w + C\sum\limits_{i = 1}^n {\log } (\exp ( - {y_i}(X_i^Tw + c)) + 1)$$

I think usually the log likelihood of a logistic regression is something like below. Clearly the first term of below is missing from the scikit-learn objective function.

$$LLH=\sum_{i=1}^n \left[{y_i}(X_i^Tw + c) - \ln\{1+\exp(X_i^Tw + c)\} \right]$$

Tom Bennett
  • 677
  • 4
  • 15
  • I've tried all of the conversion methods that are listed on this page, but none of them worked for me. This answer on a different post is well explained. https://stats.stackexchange.com/a/279698 – Futa Arai Oct 03 '21 at 01:50

3 Answers3

7

These two are actually (almost) equivalent because of the following property of the logistic function:

$$ \sigma(x) = \frac{1}{1+\exp(-x)} = \frac{\exp(x)}{\exp(x)+1} $$

Also

$$ \sum_{i=1}^n \log ( 1 + \exp( -y_i (X_i^T w + c) ) ) \\ = \sum_{i=1}^n \log \left[ (\exp( y_i (X_i^T w + c) ) + 1) \exp( -y_i (X_i^T w + c) ) \right] \\ = -\sum_{i=1}^n \left[ y_i (X_i^T w + c) - \log (\exp( y_i (X_i^T w + c) ) + 1) \right] $$

Note, though, that your formula doesn't have $y_i$ in the "log part", while this one does. (I guess this is a typo)

Artem Sobolev
  • 2,571
  • 13
  • 17
5

I don't think that the lack of $y_i$ is a typo:

The usual log-loss (cross-entropy loss) is: $$-\sum_i [y_i \log(p_i) + (1-y_i) \log(1 - p_i)],$$ where $p_i = \sigma(X^T_i \omega + c)$, and $\sigma(x) = 1/(1+e^{-x})$ is the logistic function.

From there, $$-\sum_i [y_i \log(p_i) + (1-y_i) \log(1 - p_i)] \\ = -\sum_i [y_i \log\left(\frac{p_i}{1-p_i}\right) + \log(1 - p_i)] \\ = -\sum_i [y_i \left( X^T_i \omega + c \right) + \log(1 - p_i)] \\ = -\sum_i [y_i \left( X^T_i \omega + c \right) - \log\left(1 + \exp({X^T_i \omega + c})\right)].$$ This matches the LLH expression given in the original post, without the $y_i$ factor in the exponential.

Adrien Bolens
  • 91
  • 1
  • 3
4

It is just a matter of the definition of $y_i$. Defining $y_i$ and $\tilde y_i$ such that $y_i \in \{0, 1\}$ and $\tilde y_i \in \{-1, 1\}$ ($\tilde y_i = 2y_i -1$), and using $p_i = \sigma({X^T_i \omega + c})$ and $1- \sigma(x) = \sigma(-x)$, you get

$$-\sum_i [y_i \log(p_i) + (1-y_i) \log(1 - p_i)] = \sum_i \log\left(1 + \exp(-\tilde y_i({X^T_i \omega + c}))\right).$$

Adrien Bolens
  • 91
  • 1
  • 3