What is the purpose of the scaling factor used in dropout?

Question

I have a question related to the dropout function in the LSTM tutorial: http://deeplearning.net/tutorial/code/lstm.py

def dropout_layer(state_before, use_noise, trng):
    proj = tensor.switch(use_noise,
                         (state_before *
                          trng.binomial(state_before.shape,
                                        p=0.5, n=1,
                                        dtype=state_before.dtype)),
                         state_before * 0.5)
    return proj

To my understanding, the code means that when use_noise=1, we multiple state_before by a random binary vector (i.e. the dropout procedure).
But when use_noise=0, which is used when we validate the model, we set hidden unit values as state_before*0.5.

Why *0.5 here?
Shouldn't it be just state_before without multiplying by any number?

score 7 · Answer 1 · edited Apr 13 '17 at 12:44

7

If a p=0.5 dropout is used, only half on the neurons are activate during training, while if we activate them all at test time, the output of the dropout layer would get "doubled", so in this regard it makes sense to multiply the output by a factor 1-p to neutralize that effect.

Here's a quote from the dropout paper http://arxiv.org/pdf/1207.0580v1.pdf .

At test time, we use the “mean network” that contains all of the hidden units but with their outgoing weights halved to compensate for the fact that twice as many of them are active.

Also see this question about two different ways of implementing dropout Dropout: scaling the activation versus inverting the dropout.

edited Apr 13 '17 at 12:44

Community

1

answered Jan 26 '16 at 02:55

dontloo

13,692
7
51
80

Thanks a lot. I learned dropout from a youtube lecture and did not read the original paper. So, I missed this part. – user5016984 Jan 26 '16 at 18:46
@user5016984 you are welcome. – dontloo Jan 27 '16 at 12:00

What is the purpose of the scaling factor used in dropout?

1 Answers1

Linked