0

Suppose you had a classification problem where you are trying to predict a class label (e.g., $[0 \: 1 \: 0]^T$) with a model. One way to do this is to use log loss:

$\Large \ell_{\log} = -\sum_i[y_i\log \hat{y}_i + (1-y_i)\log (1-\hat{y}_i)]$

This is attractive because it does the right thing: it pushes $\hat{y}$ to $\inf$ when $y_i$ is $1$, and to $- \inf$ when $y_i$ is zero. But another way to do this is with elementwise division:

$\Large \ell_{\text{div}} = \sum_i[\frac{\hat{y}_i}{\max (y_i, \epsilon)} + \frac{y_i}{\max (\hat{y_i}, \epsilon)}]$

Here, the minimum of the function is attained when $y$ matches $x$ on all dimensions. Isn't this a preferable cost function? Why isn't it used?

Note: $\epsilon$ is a small positive constant to prevent division by zero.

EDIT: My question isn't only about $0$ vs $1$ classification. I am also interested in situations where $y$ has real-valued entries, and we are interested in generating a vector $\hat{y}$ that matches it, in the sense that the two are "close together" in some sense. For classification problems, $\log$ is the way to go; but what about real-valued vectors?

Sam
  • 153
  • 6
  • The prediction $\hat{y}_i$ is determined through a function of the covariates $x_i$. For example, in logistic regression, that function ensures that $\hat{y}_i \in [0,1]$ so cannot be ``pushed to $-\infty$ or $\infty$''. This similarly holds in other scenarios. – user257566 Jun 28 '21 at 21:09
  • 3
    Not sure how you invented this function; what makes it preferable? It's problematic because of numerical stability in that division, for one thing. // Also - the original loss function, the log loss, isn't useful for multi class classification. – Arya McCarthy Jun 28 '21 at 23:46
  • 2
    Please don't re-post a question that was closed! It was closed for a reason. Instead, take some time to _edit_ the original question to provide the requested details/clarity. This automatically nominates it for reopening. – Arya McCarthy Jun 29 '21 at 00:02
  • Your example class label has 3 components, but the expression you've written for cross-entropy is used for binary targets. – Sycorax Jun 29 '21 at 01:12
  • @AryaMcCarthy: the log loss is indeed useful for multi-class classification, it's the log score. If you have classes $1, \dots, n$ and predicted class membership probabilities of $\hat{p}_1, \dots, \hat{p}_n$ (summing to $1$), and if the actual class of the instance turns out to be $i$, then the score is $\pm\log\hat{p}_i$ (with $\pm$ depending on whether you want a positively oriented score or not). Note how this just turns into the formula above for the 2-class case. Compare [the tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info) and references therein. – Stephan Kolassa Jun 29 '21 at 05:52
  • @AryaMcCarthy this is the kind of answer I am looking for, thank you. Where can I learn more about numerical stability and why the divisive loss is unstable? – Sam Jun 29 '21 at 06:06
  • @StephanKolassa My point was better made by Sycorax: this is the binary case, which requires care to generalize it to more classes. You and I agree; that may not have been apparent from my wording. – Arya McCarthy Jun 29 '21 at 12:15
  • @Sam https://stats.stackexchange.com/questions/260505/should-i-use-a-categorical-cross-entropy-or-binary-cross-entropy-loss-for-binary/260537#260537 This thread shows how the generalization from 2 to three or more classes works. – Sycorax Jun 29 '21 at 14:10

0 Answers0