Why normalise the standard deviation of neural network input?

Question

I understand why the mean of the input to a neural network is normalized, to avoid the numerical problems with very large and very small numbers. Also, it's nice if the bias node is around the same magnitude as the other input data.

But why is the standard deviation usually normalized? Of course higher standard deviation means that the variations have larger effect, but won't the weights adapt to that anyway?

Thanks!

why isn't log normalization used more often? https://stats.stackexchange.com/questions/452551/what-is-an-explanation-for-what-the-normalization-for-trainable-optimizers-do — Charlie Parker, Mar 05 '20 at 16:02

score 2 · Accepted Answer · answered Feb 04 '12 at 19:59

It's usually to assist with the gradient descent solver. As I understand it, the Hessian matrix becomes much more stable and easier to traverse if all the inputs are scaled.

Another important time to scale the inputs is if you are using a "weight decay". This is the same thing as a ridge parameter if you're familiar with that. Just because a feature has higher variance than another doesn't necessarily mean it has more explanatory power. By scaling all the features the ridge penalty uni-formally shrinks all the coefficients.

A much more comprehensive answer is located here:

ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std_in

that is a fabulous link! Thanks for sharing I was wondering if you've ever seen the type of scaling that is non-linear and takes logs e.g. $$ log( |x| )/p$$. I asked about it here (https://stats.stackexchange.com/questions/452551/what-is-an-explanation-for-what-the-normalization-for-trainable-optimizers-do) but it seems like a really random normalization to me. Have you seen something like that before? — Charlie Parker, Mar 04 '20 at 18:21

score 0 · Answer 2 · answered Mar 04 '20 at 18:20

I'd like to emphasize a paragraph given by the link (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std_in) by Shea Parkes:

But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers.

So the way I understand it is if the data cloud is far away from the separating hyperplane, then things might not work well. In the exaggerated example, say all points have large negative value and small variance in the data. Then everything would be in the area were the relu outputs zero. Which means that gradients are all zero. If this happens for all units then should be no training (I've not ran the experiment).

Similarly, if all the weights are on the positive side, then the hidden layer is just a linear layer...but having the hyperplanes dissect the data makes sure your NN/MLP is a non-linear learning regime at training.

Why normalise the standard deviation of neural network input?

2 Answers2

Linked