How to train a Siamese Net with convolutional + fully connected layers

Question

I am trying to implement a Siamese net for binary classification of audio based on a paper. Below is a summary of the information the authors provided about the model architecture.

This model generates embeddings using a Siamese network (SNN) consisting of convolutional layers. The network was trained using a contrastive loss. Log-Mel spectrograms are used as input. Our SNN consists of two CNNs to extract embeddings, one from each of the two inputs. Speciﬁcally, each CNN has 4 convolutional layers, each of which is followed by ReLU activations. The encoded embeddings are then concatenated and fed into a 2-layer fully connected network to estimate their similarity. The ﬁnal layer uses a sigmoid activation function to squash the output value between 0 and 1, which is regarded as the similarity value.

What I don't understand is how the full model (namely the fully-connected part of it) can be trained using the contrastive loss, which takes the two extracted embeddings as input. Should I treat the CNN and the fully-connected part of the model as separate networks, i.e. train them separately? Further, what is the standard way to evaluate SNNs? I am thinking of either comparing an unlabelled sample with one I know and set a threshold on the similarity, or take average of the embeddings for each class and pick the one my sample is the closest to.

Any advice would be highly appreciated.

How to train a Siamese Net with convolutional + fully connected layers

0 Answers0