To the best of my knowledge, the closest thing to what you might be looking for is this recent article by Google researchers: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Batch Normalization
Consider a layer $l$'s activation output $y_l = f(Wx+b)$ where $f$ is the nonlinearity (ReLU, tanh, etc), $W,b$ are the weights and biases respectively and $x$ is the minibatch of data.
What Batch Normalization (BN) does is the following:
- Standardize $Wx+b$ to have mean zero and variance one. We do it across the minibatch. Let $\hat{x}$ denote the standardized intermediate activation values, i.e. $\hat{x}$ is the normalized version of $Wx+b$.
- Apply a parameterized (learnable) affine transformation $\hat{x} \rightarrow \gamma \hat{x} + \beta.$
- Apply the nonlinearity: $\hat{y}_l = f(\gamma \hat{x} + \beta)$.
So, BN standardizes the "raw" (read: before we apply the nonlinearity) activation outputs to have mean zero, variance 1, and then we apply a learned affine transformation, and then finally we apply the nonlinearity. In some sense we may interpret this as allowing the neural network to learn an appropriate parameterized input distribution to the nonlinearity.
As every operation is differentiable, we may learn $\gamma, \beta$ parameters via backpropagation.
Affine Transformation Motivation
If we did not perform a parameterized affine transformation, every nonlinearity would have as input distribution a mean zero and variance 1 distribution. This may or may not be sub-optimal. Note that if the mean zero, variance 1 input distribution is optimal, then the affine transformation can theoretically recover it by setting $\beta$ equal to the batch mean and $\gamma$ equal to the batch standard deviation. Having this parameterized affine transformation also has the added bonus of increasing the representation capacity of the network (more learnable parameters).
Standardizing First
Why standardize first? Why not just apply the affine transformation? Theoretically speaking, there is no distinction. However, there may be a conditioning issue here. By first standardizing the activation values, perhaps it becomes easier to learn optimal $\gamma, \beta$ parameters. This is purely conjecture on my part, but there have been similar analogues in other recent state of the art conv net architectures. For example, in the recent Microsoft Research technical report Deep Residual Learning for Image Recognition, they in effect learned a transformation where they used the identity transformation as a reference or baseline for comparison. The Microsoft co-authors believed that having this reference or baseline helped pre-condition the problem. I do not believe that it is too far-fetched to wonder if something similar is occurring here with BN and the initial standardization step.
BN Applications
A particularly interesting result is that using Batch Normalization, the Google team was able to get a tanh Inception network to train on ImageNet and get pretty competitive results. Tanh is a saturating nonlinearity and it has been difficult to get these types of networks to learn due to their saturation/vanishing gradients problem. However, using Batch Normalization, one may assume that the network was able to learn a transformation which maps the activation output values into the non-saturating regime of tanh nonlinearities.
Final Notes
They even reference the same Yann LeCun factoid you mentioned as motivation for Batch Normalization.