In some tutorials I found it was stated that "Xavier" weight initialization (paper: Understanding the difficulty of training deep feedforward neural networks) is an efficient way to initialize the weights of neural networks.
For fully-connected layers there was a rule of thumb in those tutorials:
$$Var(W) = \frac{2}{n_{in} + n_{out}}, \quad \text{simpler alternative:} \quad Var(W) = \frac{1}{n_{in}}$$
where $Var(W)$ is the variance of the weights for a layer, initialized with a normal distribution and $n_{in}$, $n_{out}$ is the amount of neurons in the parent and in the current layer.
Are there similar rules of thumb for convolutional layers?
I am struggling to figure out what would be best to initialize the weights of a convolutional layer. E.g. in a layer where the shape of the weights is (5, 5, 3, 8)
, so the kernel size is 5x5
, filtering three input channels (RGB input) and creating 8
feature maps...would be 3
considered the amount of input neurons? Or rather 75 = 5*5*3
, because the input are 5x5
patches for each color channel?
I would accept both, a specific answer clarifying the problem or a more "generic" answer explaining the general process of finding the right initialization of weights and preferably linking sources.