I'm planning to make an audio generation NN. While I'm reasonably ok with neural networks in general, wavenets, etc., something is not quite clear.
What are good loss functions for audio, considering the points below?
- Target data may have variable leading silence.
- The size of the leading silence changes the "phase" of the total wave. (Even a tiny little bit of shift can ruin a comparison)
- The generator is a TTS, so nothing in the input data indicates the phase/leading silence
- If I just compare with any standard loss, the phase may cause a 100% wrong
Due to the above, I fear the model will have a real hard time to decide the size of the leading silence, especially for very similar text inputs. The waves will tend (I imagine) to get flattened because of the wild randomness of the target phases.
Are there solutions that apply a standard loss but shifting the audio first somehow? So the shift gets irrelevant?