I was doing some experiments on analyzing few audio samples and I'm stuck with this. Suppose we have a clean audio signal ($y_{clean}$), a noisy version of it ($y_{noisy}$) and an enhanced version ($y_{enhanced}$), by applying some speech enhancement algorithm to $y_{noisy}$.
What would be the right way to quantify the relative increase/decrease in the quality of $y_{enhanced}$ (compared to $y_{clean}$) in say, particular sections of the audio ? I'm not interested in SNR (or is it wrong to ignore it?) I was thinking something like computing the spectrogram for both and compute the distance between them, say using $L_{\inf}$ or $L_2$ norm. I don't know if it's even correct way to do this. Any help would be much appreciated !
(PS: I'm from CS background, so I don't have much knowledge about signal processing. Apologies if this question seems too basic)
Thanks.