0

I am trying to implement a Siamese net for binary classification of audio based on a paper. Below is a summary of the information the authors provided about the model architecture.

This model generates embeddings using a Siamese network (SNN) consisting of convolutional layers. The network was trained using a contrastive loss. Log-Mel spectrograms are used as input. Our SNN consists of two CNNs to extract embeddings, one from each of the two inputs. Specifically, each CNN has 4 convolutional layers, each of which is followed by ReLU activations. The encoded embeddings are then concatenated and fed into a 2-layer fully connected network to estimate their similarity. The final layer uses a sigmoid activation function to squash the output value between 0 and 1, which is regarded as the similarity value.

What I don't understand is how the full model (namely the fully-connected part of it) can be trained using the contrastive loss, which takes the two extracted embeddings as input. Should I treat the CNN and the fully-connected part of the model as separate networks, i.e. train them separately? Further, what is the standard way to evaluate SNNs? I am thinking of either comparing an unlabelled sample with one I know and set a threshold on the similarity, or take average of the embeddings for each class and pick the one my sample is the closest to.

Any advice would be highly appreciated.

Anna_H
  • 21
  • 4

0 Answers0