2

In siamese networks, the aim is to make closer the data from the same class and push far away the data coming from the different classes.

Suppose that we want a face identification system with 5 peoples face images (3 for each people) (P1, P2, ..., P5) and we want to train a Siamese network for face identification. While training, we first shuffle 5x3 images (Px_y: image y of participant x). My question is:

A) Do we give all combinations of pairs such as

Px_y and Px'_y' for each combination of x and x' = {1,2,3,4,5} and y and y' = {1,2,3} 

B) (or) We train a separate siamese network for each people, such that

To train network of P1, give:

P1_y and Px'_y' for each combination of x' = {1,2,3,4,5} and y and y' = {1,2,3}

To train network of P2, give:

P2_y and Px'_y' for each combination of x' = {1,2,3,4,5} and y and y' = {1,2,3}

Which one is considered when people say train a siamese network. For C classes, is it able to compare each instance in one class and give the result if they came from the same class or do we need to train a model for each participant like one VS all. Why do we use siamese networks, is it aware of class information, or it only says that they are "same" or "not same"? If the last one is the case, I suppose that for face identification case, we have to train one siamese network for each person individually.

Mas A
  • 175
  • 9

1 Answers1

0

Note that a Siamese network just learns embeddings or encodings which are low dimensional representation of the input image. To identify faces of five people you will have to do the following:

  1. Train a Siamese network to learn embeddings (or encodings)
  2. Take embeddings from Step 1 and train a separate classifier for 5-way classification.

Let's review the methodology in detail. Given your problem of face classification, it is best to train a Siamese network with Triplet loss as discussed in the FaceNet paper by Schroff et. al., 2015.

Step 1a: Data Preparation

For the Siamese network you need to provide a set of three images for each observation:

  • Anchor (A)
  • Positive (P)
  • Negative (N)

The positive image is an image of the same class as anchor image, whereas the negative image is of different class as anchor image. For example,

Example of an observation for Triplet loss training

Step 1b: Training using Triplet loss function

From Schroff et. al., 2015 the Triplet loss function is defined as:

$$ Triplet Loss(A, P, N) = max(|f(A) - f(P)|^2 - |f(A) - f(N)|^2 + margin, 0) $$

You will have to choose an appropriate value for the margin.

Step 2: Using embeddings to train a classifier

Now you need to use the embeddings to train a separate classifier. Take a single image and score it using the network you trained in Step 1 and extract data from the embedding layer. Now this is a low dimensional representation of your input image. You can use an SVM classifier to classify the embeddings into one of five classes.


Resources:

kedarps
  • 2,902
  • 2
  • 19
  • 30
  • 1
    Are you not describing triplet loss rather than a Siamese network? E.g. a nice tutorial is also given here with a lot of explanation: https://docs.fast.ai/tutorial.siamese.html – Björn Jan 14 '22 at 14:57
  • What if I want to use it to understand it is person1 and not person 2,3 or 4. Binary classification. How it will work. Will I train p1-p1 as same, p1-p2 as different, p1-p3 as different, p1-p4 as different? Then to classify p1 or not p1. Will use one branch of trained simase network as feature extractor, like transfer learning? – Mas A Jan 16 '22 at 22:28
  • @MasA: you can still train using the procedure I described, but after extracting the embeddings just do a one-vs-all classifier for each person. Remember that if you want to distinguish between all faces, you have to train the Triplet loss model with data from all faces. – kedarps Jan 18 '22 at 13:42