5

Triplet embeddings consist of mapping a group of images to an embedding space, such that images deemed more similar to each other end up closer together. The "triplet" comes from training, where we have (A,P,N), where A is our anchor image, P is a positive example image (one deemed similar to A), and N is a negative example.

So the architecture is as follows: each element of the triplet is first mapped through a convolutional neural net, followed by an embedding net. We denote this with function $f$, so that for image $x$, $f(x)$ is the embedding.

My question concerns the choice of loss function. As an example here are two different approaches:

FaceNet: A Unified Embedding for Face Recognition and Clustering

Deep metric learning using Triplet network

In the first approach, separation of positive/negative examples is achieved using a margin of separation:

$$\left[\left\|f(x_i^A)-f(x_i^P)\|_2^2-\|f(x_i^A)-f(x_i^N)\right\|_2^2+\alpha\right]_+.$$

The second approach is to use a softmax loss:

$$\frac{\exp(\|f(x_i^a)-f(x_i^p)\|_2)}{\exp(\|f(x_i^a)-f(x_i^p)\|_2)+\exp(\|f(x_i^a)-f(x_i^n)\|_2)},$$

which has the property that as the loss goes to 0, $\frac{\|f(x_i^a)-f(x_i^p)\|_2}{\|f(x_i^a)-f(x_i^n)\|_2}\rightarrow 0$, which achieves the desire to embed positive examples closer than negative examples.

I'm sure there are other ways, and I'm curious if there's a nice review of which methods work better?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Alex R.
  • 13,097
  • 2
  • 25
  • 49

0 Answers0