4

I would like to compute the similarity between audio signals of different length.

One way of doing it is to train a RNN (LSTM/GRU) Autoencoder and extract the hidden layer representation - feature vectors (of same dimension) of each audio. The audio is first transformed into a spectrogram before feeding the frequency vector at each time step to the model. With the generated feature vector, a generic distance measure (e.g. cosine similarity) can be implemented to find the similarity between the audios.

However, I've seen comments that a RNN autoencoder might not necessarily map two similar time-series sequence to nearby position in the feature space.

May I know if anyone can help verify the above statement?

1 Answers1

3

Yes, the whole basis of AE training is that they try to learn a mapping function that maps similar inputs into nearby positions in a lower-dimensional space.

If the AE has been trained properly (which you can't be sure of), then there is a high possibility that similar inputs would get nearby mappings. But that isn't by any means guaranteed.

Johnson
  • 311
  • 2
  • 5