I'd like to emphasize a paragraph given by the link (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std_in) by Shea Parkes:
But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers.
So the way I understand it is if the data cloud is far away from the separating hyperplane, then things might not work well. In the exaggerated example, say all points have large negative value and small variance in the data. Then everything would be in the area were the relu outputs zero. Which means that gradients are all zero. If this happens for all units then should be no training (I've not ran the experiment).
Similarly, if all the weights are on the positive side, then the hidden layer is just a linear layer...but having the hyperplanes dissect the data makes sure your NN/MLP is a non-linear learning regime at training.