5

Given difficult learning task (e.g high dimensionality, inherent data complexity) Deep Neural Networks become hard to train. To ease many of the problems one might:

  1. Normalize && handpick quality data
  2. choose a different training algorithm (e.g RMSprop instead of Gradient Descent)
  3. pick a steeper gradient Cost function (e.g Cross Entropy instead of MSE)
  4. Use different network structure (e.g Convolution layers instead of Feedforward)

I have heard that there are clever ways to initialize better weights. For example you can choose the magnitude better: Glorot and Bengio (2010)

  • for sigmoid units: sample a Uniform(-r, r) with $r = \sqrt{\frac{6}{N_{in} + N_{out}}}$
  • or hyperbolic tangent units: sample a Uniform(-r, r) with $r =4 \sqrt{\frac{6}{N_{in} + N_{out}}}$

Is there any consistent way of initializing the weights better?

mdewey
  • 16,541
  • 22
  • 30
  • 57
Joonatan Samuel
  • 508
  • 4
  • 9
  • 2
    Do not post [questions on multiple sites](https://datascience.stackexchange.com/questions/10926/how-to-deep-neural-network-weight-initialization) – LinkBerest Mar 31 '16 at 13:04

4 Answers4

3

Recently, Batch Normalization was introduced for this sole purpose. Please find the paper here

user52705
  • 90
  • 6
  • 1
    I am using this already. Is it enough by itself or maybe it can be further improved? – Joonatan Samuel Mar 28 '16 at 14:57
  • There were some extensions but i think this is the most popular one. I dont remember the exact name of the extensions. I think this should be enough also you can use a higher learning rate while optimizing. – user52705 Mar 28 '16 at 15:15
  • I have seen adaptive weight opt. algorithms work a lot better. But thanks a lot! – Joonatan Samuel Mar 28 '16 at 15:21
3

The paper 'all you need is a good init' is a good relatively recent article about inits in deep learning. What I liked about it is that:

  1. it has a short and effective literature survey on init methods, references included.
  2. It achieves very good results without too many bells and whistles on cifar10.
rhadar
  • 860
  • 8
  • 18
3

As far as I know the two formulas you gave are pretty much the standard initialization. I did a literature review a while ago, please see my linked answer.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
1

Weights initialization depend on the activation function being used. Xavier and Bengio(2010) derived a method for initializing weights based on the assumption that the activations are linear. Their method resulted in the formula: \begin{align} W \sim U \left[ -\frac{\sqrt 6}{\sqrt {n_{i} + n_{i+1}}}, \frac{\sqrt 6}{\sqrt {n_{i} + n_{i+1}}} \right] \end{align}

For weights initialized using uniform distribution where $n_{i}$ represents $\text{fan in}$ and $n_{i+1}$ represents $\text{fan out}$.

He, Kaiming, et al.(2015) used a derivation method that considered use of ReLUs as the activation function and obtain a weight initialization formula:

\begin{align} W_l \sim \mathcal N \left({\Large 0}, \sqrt{\frac{2}{n_l}} \right). \end{align}

For weights initialized using Gaussian distribution whose standard deviation (std) is $\sqrt{\frac{2}{n_l}}$

Read a more comprehensive series of articles covering Mathematics behind weights initialization here.