My CNN architecture use pre-initialization, i.e. BatchNorm -> ReLU -> Conv
. Which weight initialization shall I use for the convolutions? I'm under the impression that the standard ReLU initialization scheme of HeNormal is designed for Conv -> ReLU -> BatchNorm
but I am unsure of how this transfers to pre-activation scenarios. If I were performing ReLU -> BatchNorm -> Conv
it would be equivalent to Conv -> ReLU -> BatchNorm
as the order of layers is just a cyclic permutation, but since I am using the pre-activation order which is used by DenseNets or pre-ResNets I am unsure of how the intermediate BatchNorm affects the mean and variance distribution if the inputs to each unit.
I looked at the code examples in the Keras applications library and they all seem to use the default glorot uniform activation for pre-activation in networks like DenseNet. However the default weight initialization is irrelevant as the weights are loaded from pre-training on ImageNet, so I am unsure which scheme should be used.
Additionally as a secondary question I am also wondering what weight-initializations I should use for depthwise separable convolutions under the same pre-activation scenario (i.e. BatchNorm -> ReLU -> N*Conv2D-3x3 -> Conv2D-1x1
), since the convolution consists of a 2 convolutions (depthwise & pointwise) which both have different fan-in/fan-out amounts.