The formula for cross entropy loss is this: $$-\sum_iy_i \ln\left(\hat{y}_i\right).$$
My question is, what is the minimum and maximum value for cross entropy loss, given that there is a negative sign in front of the sum?
For example: let's say ground truth values, $y = [0,1,0]$ and the predicted $\hat{y} = [0,0,1].$
In this case , $y_1\ln\left(\hat{y}_1\right)=0,\;y_2\ln\left(\hat{y}_2\right)=0$ and $y_3\ln\left(\hat{y}_3\right)=0.$
So, the overall loss is $-(0+0+0)=0;$ whereas, if the predicted value matches: $y = [0,1,0] , \hat{y} = [0,1,0],$ the loss will still be $-(0+0+0) = 0$ (since $\ln(1) = 0$).
Now, what difference does it make here? Both correct and wrong predictions give a loss of zero.
And also, in many implementations of gradient descent in classification tasks, we print out the loss after a certain number of iterations. They usually start from a large number and decrease towards $0.$ In the case of negative log likelihood, shouldn't they start from a large negative value and tend towards $0?$ But how come initial loss starts from a positive number?
Can someone please elaborate on this? I clearly am confused and there are too many little details that are worrying me.