2

The formula for cross entropy loss is this: $$-\sum_iy_i \ln\left(\hat{y}_i\right).$$

My question is, what is the minimum and maximum value for cross entropy loss, given that there is a negative sign in front of the sum?

For example: let's say ground truth values, $y = [0,1,0]$ and the predicted $\hat{y} = [0,0,1].$

In this case , $y_1\ln\left(\hat{y}_1\right)=0,\;y_2\ln\left(\hat{y}_2\right)=0$ and $y_3\ln\left(\hat{y}_3\right)=0.$

So, the overall loss is $-(0+0+0)=0;$ whereas, if the predicted value matches: $y = [0,1,0] , \hat{y} = [0,1,0],$ the loss will still be $-(0+0+0) = 0$ (since $\ln(1) = 0$).

Now, what difference does it make here? Both correct and wrong predictions give a loss of zero.

And also, in many implementations of gradient descent in classification tasks, we print out the loss after a certain number of iterations. They usually start from a large number and decrease towards $0.$ In the case of negative log likelihood, shouldn't they start from a large negative value and tend towards $0?$ But how come initial loss starts from a positive number?

Can someone please elaborate on this? I clearly am confused and there are too many little details that are worrying me.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Bharathi A
  • 23
  • 3

2 Answers2

2

It's sufficient to just consider what values $-\log(\hat{y})$ can take, for $\hat{y}$ the prediction of the correct class, because the contribution of all other summands is 0.

We know that $\log$ is monotonic increasing, so we know that $-\log$ is monotonic decreasing. We have $\hat{y} \in (0,1)$, so

  • The smallest value is $-\log(1)=0$. In other words, all loss values are non-negative.
  • There is no largest value because as $\hat{y}$ decreases, $-\log(\hat{y})$ increases without bound.

If you're not sure whether $-\log$ is increasing, decreasing or neither, you can check its derivative; $\frac{d}{d \hat{y}}\left[ -\log(\hat{y})\right]=-\frac{1}{\hat{y}}$, which is always negative for $\hat{y}\in(0,1)$. Therefore, it's monotonic decreasing.

I belive you can answer all of your questions using these observations. In particular, though, it seems that your main point of confusion stems from a false computation.

For example: let's say ground truth values, $y = [0,1,0]$ and the predicted $\hat{y} = [0,0,1].$

In this case , $y_1\ln\left(\hat{y}_1\right)=0,\;y_2\ln\left(\hat{y}_2\right)=0$ and $y_3\ln\left(\hat{y}_3\right)=0.$

So, the overall loss is $-(0+0+0)=0$

This is false: $0 + 1\times-\log(0)+0=1 \times \infty = \infty$.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Perfect ! Thanks for correcting me . I have a small doubt , in computing loss , what base for log do we take ? In some places , base e is used whereas in others base 10 is used . Is there any standard base to be used in this calculation ? – Bharathi A Jul 28 '20 at 07:21
  • 1
    Typically, people use $\log_{e}$ for simplicity, but notice that using a different base just changes the *scale* of the entropy measurement. You can verify this by using the formula for the change of base of a logarithm. – Sycorax Jul 28 '20 at 12:56
0

$$-\frac1N\sum_{n=1}^N\left[y_n\log(\hat{y}_n)+(1-y_n)\log(1-\hat{y}_n)\right]$$ If you use the above formula for loss in your example you get

For wrong prediction:

  • $-(1-y_1)*\log(1-\hat y_1) = 0$

  • $-y_2*\log(\hat y_2) =$ Large number ($\log(0)$) and

  • $-(1-y_3)*\log(1-\hat y_3) =$ Large number

For the right prediction: loss = 0

So the loss is different in both cases.

We usually apply log loss to the sigmoid output $\in (0,1)$ so we can get a definite loss value.

The values of $\log()$ in $(0,1)$ are negative, so to make it positive we use a negative $\log.$

Adrian Keister
  • 3,664
  • 5
  • 18
  • 35
manvall
  • 1
  • 2