Input Normalisation for ReLU neurons

Question

According to "Efficient Backprop" by LeCun et al (1998) it is good practice to normalise all inputs so that they are centred around 0 and lie within the range of the maximum second derivative. So for example we would use [-0.5,0.5] for the "Tanh" function. This is to help the back-propagation progress as the Hessian becomes more stable.

However, I wasn't sure what to do with Rectifier neurons which are max(0,x). (Also with the logistic function since then we would want something like [0.1,0.9] however that is not centered around 0)

score 8 · Answer 1 · edited May 22 '17 at 09:19

To the best of my knowledge, the closest thing to what you might be looking for is this recent article by Google researchers: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Batch Normalization

Consider a layer $l$'s activation output $y_l = f(Wx+b)$ where $f$ is the nonlinearity (ReLU, tanh, etc), $W,b$ are the weights and biases respectively and $x$ is the minibatch of data.

What Batch Normalization (BN) does is the following:

Standardize $Wx+b$ to have mean zero and variance one. We do it across the minibatch. Let $\hat{x}$ denote the standardized intermediate activation values, i.e. $\hat{x}$ is the normalized version of $Wx+b$.
Apply a parameterized (learnable) affine transformation $\hat{x} \rightarrow \gamma \hat{x} + \beta.$
Apply the nonlinearity: $\hat{y}_l = f(\gamma \hat{x} + \beta)$.

So, BN standardizes the "raw" (read: before we apply the nonlinearity) activation outputs to have mean zero, variance 1, and then we apply a learned affine transformation, and then finally we apply the nonlinearity. In some sense we may interpret this as allowing the neural network to learn an appropriate parameterized input distribution to the nonlinearity.

As every operation is differentiable, we may learn $\gamma, \beta$ parameters via backpropagation.

Affine Transformation Motivation

If we did not perform a parameterized affine transformation, every nonlinearity would have as input distribution a mean zero and variance 1 distribution. This may or may not be sub-optimal. Note that if the mean zero, variance 1 input distribution is optimal, then the affine transformation can theoretically recover it by setting $\beta$ equal to the batch mean and $\gamma$ equal to the batch standard deviation. Having this parameterized affine transformation also has the added bonus of increasing the representation capacity of the network (more learnable parameters).

Standardizing First

Why standardize first? Why not just apply the affine transformation? Theoretically speaking, there is no distinction. However, there may be a conditioning issue here. By first standardizing the activation values, perhaps it becomes easier to learn optimal $\gamma, \beta$ parameters. This is purely conjecture on my part, but there have been similar analogues in other recent state of the art conv net architectures. For example, in the recent Microsoft Research technical report Deep Residual Learning for Image Recognition, they in effect learned a transformation where they used the identity transformation as a reference or baseline for comparison. The Microsoft co-authors believed that having this reference or baseline helped pre-condition the problem. I do not believe that it is too far-fetched to wonder if something similar is occurring here with BN and the initial standardization step.

BN Applications

A particularly interesting result is that using Batch Normalization, the Google team was able to get a tanh Inception network to train on ImageNet and get pretty competitive results. Tanh is a saturating nonlinearity and it has been difficult to get these types of networks to learn due to their saturation/vanishing gradients problem. However, using Batch Normalization, one may assume that the network was able to learn a transformation which maps the activation output values into the non-saturating regime of tanh nonlinearities.

Final Notes

They even reference the same Yann LeCun factoid you mentioned as motivation for Batch Normalization.