6

So, I made a bidirectional LSTM model for sentiment classification. Model's job was to predict ratings of movies(1-5 stars) based on the movie review.

While training the model I first used categorical cross entropy loss function. I trained the model for 10+ hours on CPU for about 45 epochs. While training every epoch showed model accuracy to be 0.5098(same for every epoch).

Then I changed the loss function to binary cross entropy and it seemed to be work fine while training. So, I want to know what exactly is the difference between these two?

Mohit Saini
  • 61
  • 1
  • 2
  • Binary cross entropy is for binary classification but categorical cross entropy is for multi class classification , but both works for binary classification , for categorical cross entropy you need to change data to `to_categorical` . – ᴀʀᴍᴀɴ Jul 17 '18 at 11:06

1 Answers1

5

I would like to expand on ARMAN's answer:

Not getting into formulas the biggest difference would be that categorical crossentropy is based on the assumption that only 1 class is correct out of all possible ones (so output should be something like [0,0,0,1,0] if the rating is 4) while binary_crossentropy works on each individual output separately implying that each case can belong to multiple classes (for instance if predicting what items a customer will get it is possible that they will buy multiple ones; i.e. output like [0,1,0,1,0] is a valid one if you are using binary_crossentropy). As ARMAN pointed out if you only have 2 classes a 2 output categorical_crossentropy is equivalent to 1 output binary_crossentropy one.

In your specific case you should be using categorical_crossentropy since each review has exactly 1 rating. Binary_crossentropy gives you better scores but the outputs are not evaluated correctly. I would also recommend trying to use MSE loss since your data is ordinal (4 stars are closer to 5 than 1)

tRosenflanz
  • 711
  • 5
  • 9
  • @tRoesenflanz: Sorry to re-open this thread by responding. I was wondering in the case where there are multiple ones for each training sample (Eg: [0,1,1,0,0,0,0,1], what activation function would be better in that case? Sigmoid or softmax? – j1897 Mar 04 '20 at 15:55
  • 1
    No problem. Softmax forces probability of all classes and thus network outputs add up to 1. Clearly that's not true in your case so sigmoid is the logical choice. Sigmoid will not make outputs of one unit effect another and so will allow multiple units to have outputs close to 1 – tRosenflanz Mar 05 '20 at 16:09