Can I use a LSTM Autoencoder to compute similarity between two variable-length audio signals?

Question

I would like to compute the similarity between audio signals of different length.

One way of doing it is to train a RNN (LSTM/GRU) Autoencoder and extract the hidden layer representation - feature vectors (of same dimension) of each audio. The audio is first transformed into a spectrogram before feeding the frequency vector at each time step to the model. With the generated feature vector, a generic distance measure (e.g. cosine similarity) can be implemented to find the similarity between the audios.

However, I've seen comments that a RNN autoencoder might not necessarily map two similar time-series sequence to nearby position in the feature space.

May I know if anyone can help verify the above statement?

score 3 · Answer 1 · answered Aug 25 '19 at 23:30

Yes, the whole basis of AE training is that they try to learn a mapping function that maps similar inputs into nearby positions in a lower-dimensional space.

If the AE has been trained properly (which you can't be sure of), then there is a high possibility that similar inputs would get nearby mappings. But that isn't by any means guaranteed.

Can I use a LSTM Autoencoder to compute similarity between two variable-length audio signals?

1 Answers1