What are the reasons behind almost all speech processing whether it be generative or recognition heavily based on Mel Spectrograms?
In a conversation with a signal processing expert I was asked why most ML systems in speech processing domain work with Mel Spectrograms instead of any other spectrograms or audio representations which may be invertible thus removing the need for stuff like Neural Vocoders.
I have tried using FFT based spectrograms in the past to no success and my assumption is that there is too much room for noise in such high dimensional data. However I was informed that lower dimensional invertible spectrograms are a thing, and the question was why isn't anyone using those?
We looked for a bit into speech processing with alternative audio representations and did not find anything, additionally looking at original Tacotron paper only provided the following paragraph explaining choice of Mel Spectrogram
Because of this redundancy, we use a different target for seq2seq decoding and waveform synthesis. The seq2seq target can be highly compressed as long as it provides sufficient intelligibility and prosody information for an inversion process, which could be fixed or trained. We use 80-band mel-scale spectrogram as the target, though fewer bands or more concise targets such as cepstrum could be used.
which doesn't really explain the choice of this specific spectrogram over anything else.
As far as we could establish, this kicked off the use of Mel Spectrogram in this domain, and we could not find any studies of alternative audio representations.