If you're trying to match a vector $p$ to $x$, why doesn't a divisive loss function $\frac{p}{x} + \frac{x}{p}$ work better than negative log loss?

Question

Suppose you had a classification problem where you are trying to predict a class label (e.g., $[0 \: 1 \: 0]^T$) with a model. One way to do this is to use log loss:

$\Large \ell_{\log} = -\sum_i[y_i\log \hat{y}_i + (1-y_i)\log (1-\hat{y}_i)]$

This is attractive because it does the right thing: it pushes $\hat{y}$ to $\inf$ when $y_i$ is $1$, and to $- \inf$ when $y_i$ is zero. But another way to do this is with elementwise division:

$\Large \ell_{\text{div}} = \sum_i[\frac{\hat{y}_i}{\max (y_i, \epsilon)} + \frac{y_i}{\max (\hat{y_i}, \epsilon)}]$

Here, the minimum of the function is attained when $y$ matches $x$ on all dimensions. Isn't this a preferable cost function? Why isn't it used?

Note: $\epsilon$ is a small positive constant to prevent division by zero.

EDIT: My question isn't only about $0$ vs $1$ classification. I am also interested in situations where $y$ has real-valued entries, and we are interested in generating a vector $\hat{y}$ that matches it, in the sense that the two are "close together" in some sense. For classification problems, $\log$ is the way to go; but what about real-valued vectors?

The prediction $\hat{y}_i$ is determined through a function of the covariates $x_i$. For example, in logistic regression, that function ensures that $\hat{y}_i \in [0,1]$ so cannot be ``pushed to $-\infty$ or $\infty$''. This similarly holds in other scenarios. — user257566, Jun 28 '21 at 21:09
Not sure how you invented this function; what makes it preferable? It's problematic because of numerical stability in that division, for one thing. // Also - the original loss function, the log loss, isn't useful for multi class classification. — Arya McCarthy, Jun 28 '21 at 23:46
Please don't re-post a question that was closed! It was closed for a reason. Instead, take some time to _edit_ the original question to provide the requested details/clarity. This automatically nominates it for reopening. — Arya McCarthy, Jun 29 '21 at 00:02
Your example class label has 3 components, but the expression you've written for cross-entropy is used for binary targets. — Sycorax, Jun 29 '21 at 01:12
@AryaMcCarthy: the log loss is indeed useful for multi-class classification, it's the log score. If you have classes $1, \dots, n$ and predicted class membership probabilities of $\hat{p}_1, \dots, \hat{p}_n$ (summing to $1$), and if the actual class of the instance turns out to be $i$, then the score is $\pm\log\hat{p}_i$ (with $\pm$ depending on whether you want a positively oriented score or not). Note how this just turns into the formula above for the 2-class case. Compare [the tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info) and references therein. — Stephan Kolassa, Jun 29 '21 at 05:52
@AryaMcCarthy this is the kind of answer I am looking for, thank you. Where can I learn more about numerical stability and why the divisive loss is unstable? — Sam, Jun 29 '21 at 06:06
@StephanKolassa My point was better made by Sycorax: this is the binary case, which requires care to generalize it to more classes. You and I agree; that may not have been apparent from my wording. — Arya McCarthy, Jun 29 '21 at 12:15
@Sam https://stats.stackexchange.com/questions/260505/should-i-use-a-categorical-cross-entropy-or-binary-cross-entropy-loss-for-binary/260537#260537 This thread shows how the generalization from 2 to three or more classes works. — Sycorax, Jun 29 '21 at 14:10

If you're trying to match a vector $p$ to $x$, why doesn't a divisive loss function $\frac{p}{x} + \frac{x}{p}$ work better than negative log loss?

0 Answers0

Linked