1

I'm working on a audio dereverberation deep learning model, based on the U-net architecture. The idea of my project came from image denoising with autoencoders. I feed the reverbered spectrogram to the network, and the netowrk should give me in output the cleaned version. I train the netowrk with pairs of spectrograms , the clean version and the reverbered version.

This is the link to one of the appers I'm following for this project: https://arxiv.org/pdf/1803.08243.pdf

My problem is, how to save spectrograms of audio data for the training. I have done two tests:

  1. I have saved spectrograms as RGB images, so they are 3D tensor, so exctly what a convolutional neworks wants in input fro training. The trained model is then able to output a recostruced version of the input spectrogram with less reverb. The problem of this solution is than , then I can't recover the audio from the cleaned spectorgram which is an RGB image.
  2. I have saved directly the spectrogram matrix with numpy.save(), and then reload with numpy.load(). With this solution i can obtain in output, directrly the dereverbereted spectrogram matrix, which can be fed to the Griffin-lim algoritm to recover the audio (this because I consider just the magnitude of the spectrogram). The problem of this solution is that, I don't know if I can feed this 2D numpy array (the stft matrix) directly to the convolutional network, or I neen to do some king of preprocessing.
  • Why would you save as RGB? It triples data size without adding any information. CNN's aren't restricted to any input channel size. – OverLordGoldDragon Jan 01 '22 at 23:40
  • thank you for the comment. I completly agree with you. But when train the U-net, results are better with the spectrograms saved as an image rather and I don't know why. – Lorenzoncina Jan 02 '22 at 14:11
  • Strange. Not the first time I've seen this, and it makes no sense to me - I've [asked about it](https://stats.stackexchange.com/q/559009/239063). – OverLordGoldDragon Jan 02 '22 at 21:00

1 Answers1

0

Spectrograms will work with any network that can operate on images. A spectrogram, however, is not an image, and many image techniques will be inapplicable:

  1. Data augmentation via rotation: a rotated spectrogram doesn't represent the same process at all, or even any process (there may not be a signal that maps to a given 2D array).
  2. Some networks are tailored specifically to exploit image-specific priors (such as rotation-invariance per point 1) which are useless or detrimental to time-frequency represntations.

This is more of an ML question suited for another SE site where one could comment specifically on U-net. As for RGB, it triples data size without adding any information, and should degrade performance as it breaks spatial dependencies - regardless I've seen it used before, and opened a question on it.

OverLordGoldDragon
  • 3,570
  • 2
  • 6
  • 30