What is the benefit of the truncated normal distribution in initializing weights in a neural network?

Question

When initializing connection weights in a feedforward neural network, it is important to initialize them randomly to avoid any symmetries that the learning algorithm would not be able to break.

The recommendation I have seen in various places (eg. in TensorFlow's MNIST tutorial) is to use the truncated normal distribution using a standard deviation of $\dfrac{1}{\sqrt{N}}$, where $N$ is the number of inputs to the given neuron layer.

I believe that the standard deviation formula ensures that backpropagated gradients don't dissolve or amplify too quickly. But I don't know why we are using a truncated normal distribution as opposed to a regular normal distribution. Is it to avoid rare outlier weights?

Can you provide source of this recommendation and/or the direct quotation? — Tim, Aug 07 '16 at 17:54
+Tim Good point, I added a link to an example. I believe I also saw this recommendation in a paper about neural network good practices (can't find it, though). — MiniQuark, Aug 07 '16 at 18:18

Güngör Basa · Accepted Answer · 2017-04-08T05:24:21.990

14

I think its about saturation of the neurons. Think about you have an activation function like sigmoid.

If your weight val gets value >= 2 or <=-2 your neuron will not learn. So, if you truncate your normal distribution you will not have this issue(at least from the initialization) based on your variance. I think thats why, its better to use truncated normal in general.

edited Apr 08 '17 at 05:24

answered Mar 25 '17 at 17:18

Güngör Basa

256
2
5

Yes, that makes sense, thanks. I think you meant "value >= 2", not 1. – MiniQuark Mar 29 '17 at 13:47
yes it suppose to be value >= 2 – Güngör Basa Apr 08 '17 at 05:23
Why would my neuron not learn if the weight value was >= 2? – Omar AlSuwaidi Sep 01 '21 at 03:35

Lerner Zhang · Answer 2 · 2021-09-16T14:40:19.760

The truncated normal distribution is better for the parameters to be close to 0, and it's better to keep the parameters close to 0. See this question: https://stackoverflow.com/q/34569903/3552975

Three reasons to keep the parameters small(Source: Probabilistic Deep Learning: with Python, Keras and Tensorflow Probability):

Experience shows that trained NNs often have small weights.
Smaller weights lead to less extreme outputs (in classification, less extreme probabilities), which is desirable for an untrained model.
It’s a known property of prediction models that adding a component to the loss function, which prefers small weights, often helps to get a higher prediction performance. This approach is also known as regularization or weight decay in non-Bayesian NNs.

And in this blog: A Gentle Introduction to Weight Constraints in Deep Learning, Dr. Jason Brownlee states that:

Smaller weights in a neural network can result in a model that is more stable and less likely to overfit the training dataset, in turn having better performance when making a prediction on new data.

If you employ ReLU you'd better make it with a slightly positive initial bias:

One should generally initialize weights with a small amount of noise for symmetry breaking, and to prevent 0 gradients. Since we're using ReLU neurons, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons".

I'm not sure how using the truncated_normal will prevent dead neurons: it won't add any "slightly positive initial bias". Can you please elaborate? — MiniQuark, Mar 01 '17 at 15:06
because the backpropagation will only update 'live' neurons, with some nonzero contribution to the propagation — Jason, Jul 29 '18 at 10:55
Can you edit your first paragraph? It seems to be missing a few words. // The quotation from the SNN paper is specifically about the self-normalizing network. It's not a general claim that all 0-mean, unit-variance activations will be propagated through the network, which they demonstrate by comparing SNNs to standard FFNs. — Sycorax, Sep 16 '21 at 14:09
@Sycorax Is it better now? Any further suggestions would be appreciated. — Lerner Zhang, Sep 16 '21 at 14:15

What is the benefit of the truncated normal distribution in initializing weights in a neural network?

2 Answers2

Linked

Related