Normalization for MFCC?

Question

I'm planning on using MFCCs extracted from audio signals to make a speaker recognizer. I noticed that the first MFCC term tends to be very large, compared to the others. That's why I think that normalization is needed when working with machine learning algorithms (LSTM and HMM in my case). So, I think that I should have my MFCCs values between (-0.5,0.5) or (-1,1).

I tried (mfccs-mean)/std and I'm currently trying with minmax normalization.

I know how each of these methods are calculated but what are the differences when using them or any other with a machine learning algorithm?

Hi there and welcome. My two cents: input scaling is there only/mostly to aid numerical optimization. I personally prefer min-max normalization as it puts all inputs on the same scale. – *Reviewer* — Jim, May 29 '18 at 10:59
Thanks! That's what I am using by now, I'll see in a few days how it behaves. Working with `mffc-mean/std`didn'y throw very good results — Isaac, May 29 '18 at 18:57

score 1 · Accepted Answer · edited Jan 26 '20 at 20:16

The common practice is to use Cepstral Mean and Variance Normalization (CMVN), and its equation is already mentioned by you. Alternatively, you can try CMVN on sliding window or feature warping. Note that commonly Mel-Frequency Cepstral Coefficients (MFCCs) are being used in neural network where the normalization can be apparently done by batchnorm inside the network. However, experimental results shows pre-normalization of feature with sliding window CMVN helps.

Normalization for MFCC?

1 Answers1