2

For a siamese network, contrastive loss is typically used. $$ L = y \cdot d(x_0, x_1) + (1-y)\max(0, m - d(x_0, x_1)) $$ That is, it tries to reduce the distance for the positive examples and increase it for negative examples upto the margin, m.

Is there any advantage over, say, using log likelihood?

$$ L = y \sigma(d(x_0, x_1)) + (1-y)(1 - \sigma(d(x_0, x_1)) $$

Sycorax
  • 76,417
  • 20
  • 189
  • 313
DSKim
  • 1,109
  • 2
  • 11
  • 15
  • You can use math typesetting in your posts. More information: https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference – Sycorax Aug 29 '20 at 17:11
  • Hello Kim, why we don't update the model please in case we found that pair is dissimilar? How the network will learn if we don't update it in that case? – Avv Dec 29 '21 at 04:18

1 Answers1

2

The main point of using a strategy like siamese loss or triplet loss is that you don't have to know all of your classes at training time -- you just want to use the classes that you do have at training time to create a representation that might succeed in the presence of new classes in the future. The way that these losses find a good representation is to compare distances: minimize distance between like classes, and separate unlike classes by some margin $m$.

Classification losses require you to know ahead of time exactly what all of your classes are, because a classifier assigns every point in the space of inputs to a class. You can't add a new class later, because all of the input space has already been assigned to a class.

A secondary point is that the difference between probability and distance is important. Distances aren't the same as probabilities. Probabilities are non-negative, and probabilities of mutually-exclusive events sum to 1. Distances are also non-negative, but are not constrained to be less than or equal to 1. Because the smallest distance is 0, sigmoid functions of distances are always 0.5 or larger: $\sigma(0)=\frac{1}{1 + \exp(-0)}=\frac{1}{2}$. This isn't particularly important from an optimization perspective, but it does mean that our intuition that the loss be minimized at 0 is not satisfied.

Sycorax
  • 76,417
  • 20
  • 189
  • 313