Dropout: scaling the activation versus inverting the dropout

Question

When applying dropout in artificial neural networks, one needs to compensate for the fact that at training time a portion of the neurons were deactivated. To do so, there exist two common strategies:

scaling the activation at test time
inverting the dropout during the training phase

The two strategies are summarized in the slides below, taken from Standford CS231n: Convolutional Neural Networks for Visual Recognition.

Which strategy is preferable, and why?

Scaling the activation at test time:

Inverting the dropout during the training phase:

Why should we scale $\frac{1}{p}$ instead of $\frac{1}{1-p}$ ? My intuition is that the more the dropout, the more we should compensate, right? Look at the relation of dropout rate to rescale factor: $dropout-> \frac{1}{p} -> \frac{1}{1-p}$

$0.2 -> 5 -> 1.25$

$0.5 -> 2 -> 2$

$0.8 -> 1.25 -> 5$ — Ken Chan, Apr 25 '17 at 05:41
"the more the dropout, the more we should compensate" Correct, this is why we use 1/p. If p=.1 so that 90% of the outputs are dropped, we need to scale up those values by 10. Note that 1/.1=10, whereas 1/(1-.1)=1/(.9)=1.1. — Tahlor, Oct 31 '19 at 16:48

score 7 · Answer 1 · answered Jun 09 '19 at 10:07

Andrew made very good explanation in his Deep Learning course on this session Dropout Regularization:

Inverted dropout is more common because it makes the testing much easier
The purpose of the inverting is to assure that the Z value will not be impacted by the reduce of W.

Say a3 = a3 / keep_prob at the last step of implementation:

Z^[4] = W^[4] * a^[3] + b^[4] , the element size of a^[3] has been reduced by keep_prob from D3(a percentage of elements have been dropped out by D3), thus the value of Z^[4] is also gonna be reduced, so to compensate this roughly we shall invert the change by dividing keep_prob to make sure the value of Z^[4] will not be impacted.

dontloo · Answer 2 · 2019-01-15T05:00:16.697

"inverting the dropout during the training phase" should be preferable.

Theoretically if we see Bernoulli dropout as a method of adding noise to the network, it's better that the noise could have a zero mean. If we do the scaling at training time to cancel out the portion of deactivated units, the mean of the noise would be zero.

There are other types of dropout/noise methods came out later (e.g. Gaussian multiplicative dropout, Gaussian additive noise) that also possess a zero mean.

In terms of training and testing neural networks in practice, there a reason to prefer such implementation as well. Say I want to compare the performance of two models with the same architecture, one is trained by dropout and one is not.

If I "scale the activation at test time", then I'll need two different networks at test time. If I use the "inverted version" then I could just apply the same test network (code) to both sets of trained parameters.

Let me make sure I'm understanding correctly: the non-inverted dropout-trained model would actually have a different architecture, because it would need to implement the multiplication by p at each layer? So the computer sees it as topologically different? — Eric Auld, Feb 18 '18 at 00:40
@EricAuld hi sorry for the late reply, yes that's basically what i meant, just by "apply the same test network" i meant you won't need to change the code. — dontloo, Feb 23 '18 at 05:28

score 4 · Answer 3 · answered Dec 06 '17 at 22:41

Another advantage of doing the inverted dropout (besides not having to change the code at test time) is that during training one can get fancy and change the dropout rate dynamically. This has been termed as "annealed" dropout. Essentially the logic is that adding dropout "noise" towards the beginning of training helps to keep the optimization from getting stuck in a local minimum, while letting it decay to zero in the end results in a finer tuned network with better performance.

ANNEALED DROPOUT TRAINING OF DEEP NETWORKS

Modified Dropout for Training Neural Network

Dropout: scaling the activation versus inverting the dropout

3 Answers3

Linked

Related