CNN xavier weight initialization

Question

In some tutorials I found it was stated that "Xavier" weight initialization (paper: Understanding the difficulty of training deep feedforward neural networks) is an efficient way to initialize the weights of neural networks.

For fully-connected layers there was a rule of thumb in those tutorials:

$$Var(W) = \frac{2}{n_{in} + n_{out}}, \quad \text{simpler alternative:} \quad Var(W) = \frac{1}{n_{in}}$$

where $Var(W)$ is the variance of the weights for a layer, initialized with a normal distribution and $n_{in}$, $n_{out}$ is the amount of neurons in the parent and in the current layer.

Are there similar rules of thumb for convolutional layers?

I am struggling to figure out what would be best to initialize the weights of a convolutional layer. E.g. in a layer where the shape of the weights is (5, 5, 3, 8), so the kernel size is 5x5, filtering three input channels (RGB input) and creating 8 feature maps...would be 3 considered the amount of input neurons? Or rather 75 = 5*5*3, because the input are 5x5 patches for each color channel?

I would accept both, a specific answer clarifying the problem or a more "generic" answer explaining the general process of finding the right initialization of weights and preferably linking sources.

dontloo · Accepted Answer · 2020-02-03T18:27:44.187

14

In this case the amount of neurons should be 5*5*3.

I found it especially useful for convolutional layers. Often a uniform distribution over the interval $[-c/(in+out), c/(in+out)]$ works as well.

It is implemented as an option in almost all neural network libraries. Here you can find the source code of Keras's implementation of Xavier Glorot's initialization.

edited Feb 03 '20 at 18:27

answered Feb 27 '16 at 03:45

dontloo

13,692
7
51
80

1

Hmm..do you have any additional advices? E.g. one of my networks has a fully-connected layer with 480.000 neurons. If I apply Xavier initialization I end up with a variance of roughly $1 * 10^{-6}$ and my network just learns some strange interference patterns. I guess it falls into some local minimum. I mean the weights are *really* small then. I mostly experience reasonable learning with something in the interval $[0.1, 0.01]$. Any ideas on that? I think the Xavier initialization does not apply to really large layers? – daniel451 Feb 28 '16 at 07:42
@ascenator sorry I don't know much about how the weights changes during training. sometimes strange results may come from too large/small learning rates though. – dontloo Feb 28 '16 at 13:33
Many DL libraries take a standard deviation term, not a variance term, as the parameter to their random number generation methods. So for a variance of $10^{-6}$, you'd need a standard deviation of $10^{-3}$, which might explain your results. – eric.mitchell Dec 20 '17 at 14:47

score 0 · Answer 2 · answered Oct 28 '18 at 07:21

I second Eric's answer here. I also take the "sqrt" of the term and not just that term. In spite of that, when you connect sigmoid deep in your net to "RelU" output.... it can cause the training to stall. This is because of unbounded "Relu" output which can make the gradient at sigmoid to fall to 0 and no learning happens. So, in the cases, I have a "scaleDown" factor for my net which will weigh down the initialization deviation by that factor. I keep empirically tuning the weights until learning happens. A simple way to find is to save the model immediately after 1 iteration and have a look at RELU output (that is connected to sigmoid). Keep tuning the weights until this RELU output is reasonable. And then use those weights for training. It is a good start. If it still collapses after a few iterations, weigh them down slightly bit more until you reach stability. Its just a hack I used. It worked for me for my setup. So sharing my experience. Different things work for different setups.

So... Good Luck!

CNN xavier weight initialization

2 Answers2

Linked