Purpose of scaling weights/states when using dropout in a neural network

Question

In Goodfellow's Deep Learning book (http://www.deeplearningbook.org/contents/regularization.html 7.12) they state:

Because we usually use an inclusion probability of 1/2, the weight scaling rule usually amounts to dividing the weights by 2 at the end of training, and then using the model as usual. Another way to achieve the same result is to multiply the states of the units by 2 during training.

Could someone explain the purpose of rescaling when using dropout? I am having trouble grasping what exactly this is correcting for.

I think a much better answer is here (https://stats.stackexchange.com/questions/192482/what-is-the-purpose-of-the-scaling-factor-used-in-dropout), where along with the explanation, link to original paper discussing the "mean-network" is discussed. — f3n1Xx, Jul 12 '21 at 22:55

score 5 · Accepted Answer · answered Dec 15 '16 at 09:45

5

Consider the case you have dropout with p probability where p is (0,1] ,

the expected value of an output feature is p*E(WT+x),as only p units are used, say if feature>=4 then class A else B, now for the same input if in test time you do not have any dropout the Expected value of the activation is: E(WT+x) as all units are used, thus to prevent the decision boundary from shifting you reweigh the weights by 1/p to keep the expected activation same at the final layer.

In short you are doing weighted average(and not the addition) of the exponential set of networks learnt with dropout.

answered Dec 15 '16 at 09:45

Amitoz Dandiana

536
2
5

It seems to me that if dropout is used only in the last layer (as it is usually the case for convnets with a fully-connected layer at the end), then the scaling is not needed, because it will not affect choosing the largest output of the model (for classification task). – MichaelSB May 05 '19 at 00:37
@MichaelSB not if you don't use softmax – crypdick Jul 11 '20 at 01:24
@crypdick why would it matter if we use softmax or not? I still don't see any need for scaling if we only use dropout in the last layer. – MichaelSB Jul 11 '20 at 10:49

Purpose of scaling weights/states when using dropout in a neural network

1 Answers1