Softmax + CE vs Sigmoid + BCE for batched training with negative sampling, for training similarity properties

Question

This is a follow up to this question

Machine Learning: Should I use a categorical cross entropy or binary cross entropy loss for binary predictions?

I am training cos similarity properties for question and answers, and I am wondering what are the advantages and disadvantages to using softmax + CE vs sigmoid +BCE.

Say that in a batch, I have the following questions and answers, each represented by a vector

Questions: What's in the sky? What appears at night? What's hot?

Answers: Moon Sun Fire

Correct Combinations: What's in the sky? Moon, Sun What appears at night? Moon What's hot? Fire, Sun.

In this batch, 'Moon' and 'Sun' are both answers to 'What's in the sky?', so the cos similarity of the question vector to the answer vectors should be trained equal to 1. But 'Fire' is not an answer to that question, so the cos similarity should be trained to equal to 0.

There are two approaches for this, one using sigmoid + BCE, another using softmax + CE.

Sigmoid + BCE:

After taking the dot products of all combinations, pass them through a sigmoid, and doing binary cross entropy loss with the correct label , 0 or 1.

For example:

partialLoss1 = 1 * Log ( Sigmoid( DotProduct( Vector{What's in the sky}, Vector{Moon} ) ) )

partialLoss2 = 0 * Log ( Sigmoid( DotProduct( Vector{What's in the sky}, Vector{Fire} ) ) )

and do this for all the combinations of questions and answers.

Softmax + CE:

In this approach, a softmax is taken over each question and their answers combinations. For questions that have two correct answers, the softmax output should be .5 for each of correct answers.

If doing the softmax over all possible dot products of answers for 'What's in the sky?', the softmax probabilities should be .5 for 'Moon' , .5 for 'Sky', and 0 for 'Fire'.

And this the loss for the output labels and the softmax probabilities can be calculated with regular cross entropy.

For example:

partialLoss1 = .5*Log( SoftmaxProbability{ Vector{What's in the sky}, Vector{Moon} } )

partialLoss2 = .5*Log( SoftmaxProbability{ Vector{What's in the sky}, Vector{Sun} } )

partialLoss3 = 0*Log( SoftmaxProbability{ Vector{What's in the sky}, Vector{Fire} } )

Are both these approaches correct? If so, what are the advantages and disadvantages to each one?

Softmax + CE vs Sigmoid + BCE for batched training with negative sampling, for training similarity properties

0 Answers0