Why doesn't a divisive loss function $\frac{\hat{y}}{y} + \frac{y}{\hat{y}}$ work better than (or just as well as) negative log loss?

Asked Jun 28 '21 at 23:38

Active Jun 29 '21 at 00:38

Viewed 22 times

Suppose you had a classification problem where you are trying to predict a class label (e.g., $[0 \: 1 \: 0]^T$) with a model. One way to do this is to use log loss:

$\Large L_{\log} = -\sum_i[y_i\log \hat{y}_i + (1-y_i)\log (1-\hat{y}_i)]$

This is attractive because it does the right thing: it pushes $\hat{y}$ to $\inf$ when $y_i$ is $1$, and to $- \inf$ when $y_i$ is zero. But another way to do this is with elementwise division:

$\Large L_{\text{div}} = \sum_i[\frac{y_i}{\max (\hat{y_i}, \epsilon)} + \frac{\hat{y}_i}{\max (y_i, \epsilon)}]$

Note: $\epsilon$ is a small positive constant to prevent division by zero.

Here, the minimum of the function is attained when $y$ matches $x$ on all dimensions. Isn't this a preferable cost function? Why isn't it used?

asked Jun 28 '21 at 23:38

Sam

Why doesn't a divisive loss function $\frac{\hat{y}}{y} + \frac{y}{\hat{y}}$ work better than (or just as well as) negative log loss?

0 Answers0