4

Does anyone have the interpretation of a log-loss value? Am I correct to assume that values closer to 0 and 1 are more likely to be an indication that the predicted value is incorrect?

Richard
  • 43
  • 4

1 Answers1

8

Let $D = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ be a set of i.i.d. observations where $x_i$ is some $n$ dimensional vector of independent variables and $y_i$ is a binary dependent variable. A common assumption is to assume $y_i \sim Ber(f(x_i))$ for some function $f\colon X \to [0,1]$. To model this assumption we can use some parameterized function $h(\cdot; \theta)$. Taking $$h(x;\theta) = \left(1 + e^{-\theta^T x}\right)^{-1}$$ with $\theta \in \mathbb{R}^n$ yields logistic regression. Taking $$h(x;\theta = \{w, W\}) = \left(1 + e^{-w^T\sigma(Wx)}\right)^{-1}$$ with $W \in \mathbb{R}^{m \times n}$, $w \in \mathbb{R}^m$, and $\sigma$ an elementwise sigmoid function yields a standard feed forward neural network for binary classification.

The log-loss arises when we ask how should we should choose the value of $\theta$ given our available data. The maximum likelihood estimate (MLE) for $\theta$ is given by $$ \begin{align*} \theta^* &= \arg\max_{\theta}\left\{\prod_{i=1}^n P(y_i \mid h(x_i; \theta)) \right\}\\ &= \arg\max_{\theta}\left\{\prod_{i=1}^n h(x_i; \theta)^{y_i}(1 - h(x_i;\theta))^{1-y_i}\right\}\\ &= \arg\max_{\theta}\left\{\sum_{i=1}^n y_i\log(h(x_i;\theta)) + (1-y_i)\log(1-h(x_i;\theta))\right\}\\ &= \arg\min_{\theta}\left\{\sum_{i=1}^n -y_i\log(h(x_i;\theta)) - (1-y_i)\log(1-h(x_i;\theta))\right\}.\\ \end{align*} $$

You should recognize the last line as the log-loss function.

If by values close to 0 and 1 you mean that your model outputs theses values, then I think the opposite of what you said is true, values close to 0 and 1 indicate less uncertainty, values close to 0.5 indicate uncertainty.

alto
  • 3,538
  • 17
  • 20