I would like to compute the similarity between audio signals of different length.
One way of doing it is to train a RNN (LSTM/GRU) Autoencoder and extract the hidden layer representation - feature vectors (of same dimension) of each audio. The audio is first transformed into a spectrogram before feeding the frequency vector at each time step to the model. With the generated feature vector, a generic distance measure (e.g. cosine similarity) can be implemented to find the similarity between the audios.
However, I've seen comments that a RNN autoencoder might not necessarily map two similar time-series sequence to nearby position in the feature space.
May I know if anyone can help verify the above statement?