What are good basic loss functions for audio generation? (TTS)

Question

I'm planning to make an audio generation NN. While I'm reasonably ok with neural networks in general, wavenets, etc., something is not quite clear.

What are good loss functions for audio, considering the points below?

Target data may have variable leading silence.
The size of the leading silence changes the "phase" of the total wave. (Even a tiny little bit of shift can ruin a comparison)
The generator is a TTS, so nothing in the input data indicates the phase/leading silence
If I just compare with any standard loss, the phase may cause a 100% wrong

Due to the above, I fear the model will have a real hard time to decide the size of the leading silence, especially for very similar text inputs. The waves will tend (I imagine) to get flattened because of the wild randomness of the target phases.

Are there solutions that apply a standard loss but shifting the audio first somehow? So the shift gets irrelevant?

You may want to do some alignment to remove variation in the training data. — Jon Nordby, May 16 '20 at 13:33

score 2 · Answer 1 · answered May 16 '20 at 13:47

You are right that doing a comparison in the time/waveform domain is susceptible to large differences based on things that don't impact the speech much, like minor time shifts or phase differences. You can do a little bit better by operating in the Time-Frequency domain and discard phase, using a time resolution of around 25 ms. Typical representations are Melspectrogram and Gammatone spectrograms, usually log-scaled.

However this is still far from a compact/efficient representation of the things important to human perception of audio or speech. So there exists many proposals for Perceptual Audio metrics, and in the latter years a lot of work on differentiable ones that can be used as a loss function in Deep Neural Networks.

If you do a litterature search for combinations of "perceptual loss", "audio", "differentiable", "generative", "text to speech" you should find tens of relevant papers.

A very recent paper (January 2020) that also has a clean open-source code (using Tensorflow), is A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences by Pranay Manocha et al. The code can be found here: https://github.com/pranaymanocha/PerceptualAudio

What are good basic loss functions for audio generation? (TTS)

1 Answers1