1

What are the reasons behind almost all speech processing whether it be generative or recognition heavily based on Mel Spectrograms?

In a conversation with a signal processing expert I was asked why most ML systems in speech processing domain work with Mel Spectrograms instead of any other spectrograms or audio representations which may be invertible thus removing the need for stuff like Neural Vocoders.

I have tried using FFT based spectrograms in the past to no success and my assumption is that there is too much room for noise in such high dimensional data. However I was informed that lower dimensional invertible spectrograms are a thing, and the question was why isn't anyone using those?

We looked for a bit into speech processing with alternative audio representations and did not find anything, additionally looking at original Tacotron paper only provided the following paragraph explaining choice of Mel Spectrogram

Because of this redundancy, we use a different target for seq2seq decoding and waveform synthesis. The seq2seq target can be highly compressed as long as it provides sufficient intelligibility and prosody information for an inversion process, which could be fixed or trained. We use 80-band mel-scale spectrogram as the target, though fewer bands or more concise targets such as cepstrum could be used.

which doesn't really explain the choice of this specific spectrogram over anything else.

As far as we could establish, this kicked off the use of Mel Spectrogram in this domain, and we could not find any studies of alternative audio representations.

Rijul Gupta
  • 111
  • 2
  • 1
    Good question. See [Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks](https://arxiv.org/abs/1706.07156) by M. Huzaifah for some motivation. – mhdadk May 12 '21 at 14:57
  • @mhdadk that's really good reference to follow, however I have even more questions now. Do you know why they didn't try to tune their networks for any other spectrum? Being a classification problem it makes sense to use the lowest resolution representation that is still intelligible, but it doesn't really say much about the applicability of that for anything besides that one task with that specific formulation. – Rijul Gupta May 12 '21 at 15:04
  • I think the key is the [Mel scale](https://en.wikipedia.org/wiki/Mel_scale). Because the Mel scale closely mimics human perception, then it offers a good representation of the frequencies that humans typically hear. Also, a spectrogram is just the square of the magnitude spectrum of an audio signal. What other spectrums did you have in mind? – mhdadk May 12 '21 at 15:34
  • It doesn't have to be a spectrum, the key point of our discussion was using invertible representations to remove the need of vocoders. Honestly Mel Scale does make intuitive sense, but I don't think there is any mathematical rigor behind that intuition. – Rijul Gupta May 12 '21 at 15:45
  • There is no mathematical rigor behind the Mel scale. There is some physiological and psychoacoustics behind it, but even that is not really rigorous. See things like Bark scale. But Mel works considerably better than standard FFT spacing - so that is why it is used, pretty much – Jon Nordby Jun 23 '21 at 14:24
  • Which kind of spectrogram do you refer to by "lower dimensionality invertible spectrograms" ? – Jon Nordby Jun 23 '21 at 14:25
  • The issue with inverting a spectrogram is due to the complex value (or rather, the phase) component - not really with the frequency bin spacing in a magnitude spectrogram (Mel vs Bark vs FFT etc). There are some works now that use the Discrete Cosine Transform, which is real-valued only and sidesteps the phase issue – Jon Nordby Jun 23 '21 at 14:27

0 Answers0