How to train a FCNN with Spectrogram images?

Question

I'm working on a audio dereverberation deep learning model, based on the U-net architecture. The idea of my project came from image denoising with autoencoders. I feed the reverbered spectrogram to the network, and the netowrk should give me in output the cleaned version. I train the netowrk with pairs of spectrograms , the clean version and the reverbered version.

This is the link to one of the appers I'm following for this project: https://arxiv.org/pdf/1803.08243.pdf

My problem is, how to save spectrograms of audio data for the training. I have done two tests:

I have saved spectrograms as RGB images, so they are 3D tensor, so exctly what a convolutional neworks wants in input fro training. The trained model is then able to output a recostruced version of the input spectrogram with less reverb. The problem of this solution is than , then I can't recover the audio from the cleaned spectorgram which is an RGB image.
I have saved directly the spectrogram matrix with numpy.save(), and then reload with numpy.load(). With this solution i can obtain in output, directrly the dereverbereted spectrogram matrix, which can be fed to the Griffin-lim algoritm to recover the audio (this because I consider just the magnitude of the spectrogram). The problem of this solution is that, I don't know if I can feed this 2D numpy array (the stft matrix) directly to the convolutional network, or I neen to do some king of preprocessing.

Why would you save as RGB? It triples data size without adding any information. CNN's aren't restricted to any input channel size. — OverLordGoldDragon, Jan 01 '22 at 23:40
thank you for the comment. I completly agree with you. But when train the U-net, results are better with the spectrograms saved as an image rather and I don't know why. — Lorenzoncina, Jan 02 '22 at 14:11
Strange. Not the first time I've seen this, and it makes no sense to me - I've [asked about it](https://stats.stackexchange.com/q/559009/239063). — OverLordGoldDragon, Jan 02 '22 at 21:00

score 0 · Answer 1 · answered Jan 02 '22 at 22:31

Spectrograms will work with any network that can operate on images. A spectrogram, however, is not an image, and many image techniques will be inapplicable:

Data augmentation via rotation: a rotated spectrogram doesn't represent the same process at all, or even any process (there may not be a signal that maps to a given 2D array).
Some networks are tailored specifically to exploit image-specific priors (such as rotation-invariance per point 1) which are useless or detrimental to time-frequency represntations.

This is more of an ML question suited for another SE site where one could comment specifically on U-net. As for RGB, it triples data size without adding any information, and should degrade performance as it breaks spatial dependencies - regardless I've seen it used before, and opened a question on it.

@FilipePinto [DS.SE](https://datascience.stackexchange.com), [CV.SE](https://stats.stackexchange.com/), [AI.SE](https://ai.stackexchange.com/) — OverLordGoldDragon, Feb 04 '22 at 19:27

How to train a FCNN with Spectrogram images?

1 Answers1