I'm working on a audio dereverberation deep learning model, based on the U-net architecture. The idea of my project came from image denoising with autoencoders. I feed the reverbered spectrogram to the network, and the netowrk should give me in output the cleaned version. I train the netowrk with pairs of spectrograms , the clean version and the reverbered version.
This is the link to one of the appers I'm following for this project: https://arxiv.org/pdf/1803.08243.pdf
My problem is, how to save spectrograms of audio data for the training. I have done two tests:
- I have saved spectrograms as RGB images, so they are 3D tensor, so exctly what a convolutional neworks wants in input fro training. The trained model is then able to output a recostruced version of the input spectrogram with less reverb. The problem of this solution is than , then I can't recover the audio from the cleaned spectorgram which is an RGB image.
- I have saved directly the spectrogram matrix with numpy.save(), and then reload with numpy.load(). With this solution i can obtain in output, directrly the dereverbereted spectrogram matrix, which can be fed to the Griffin-lim algoritm to recover the audio (this because I consider just the magnitude of the spectrogram). The problem of this solution is that, I don't know if I can feed this 2D numpy array (the stft matrix) directly to the convolutional network, or I neen to do some king of preprocessing.