5

I have a question related to the dropout function in the LSTM tutorial: http://deeplearning.net/tutorial/code/lstm.py

def dropout_layer(state_before, use_noise, trng):
    proj = tensor.switch(use_noise,
                         (state_before *
                          trng.binomial(state_before.shape,
                                        p=0.5, n=1,
                                        dtype=state_before.dtype)),
                         state_before * 0.5)
    return proj

To my understanding, the code means that when use_noise=1, we multiple state_before by a random binary vector (i.e. the dropout procedure).
But when use_noise=0, which is used when we validate the model, we set hidden unit values as state_before*0.5.

Why *0.5 here?
Shouldn't it be just state_before without multiplying by any number?

Oren Milman
  • 1,132
  • 11
  • 25
user5016984
  • 121
  • 5

1 Answers1

7

If a p=0.5 dropout is used, only half on the neurons are activate during training, while if we activate them all at test time, the output of the dropout layer would get "doubled", so in this regard it makes sense to multiply the output by a factor 1-p to neutralize that effect.

Here's a quote from the dropout paper http://arxiv.org/pdf/1207.0580v1.pdf .

At test time, we use the “mean network” that contains all of the hidden units but with their outgoing weights halved to compensate for the fact that twice as many of them are active.

Also see this question about two different ways of implementing dropout Dropout: scaling the activation versus inverting the dropout.

dontloo
  • 13,692
  • 7
  • 51
  • 80