5

In Goodfellow's Deep Learning book (http://www.deeplearningbook.org/contents/regularization.html 7.12) they state:

Because we usually use an inclusion probability of 1/2, the weight scaling rule usually amounts to dividing the weights by 2 at the end of training, and then using the model as usual. Another way to achieve the same result is to multiply the states of the units by 2 during training.

Could someone explain the purpose of rescaling when using dropout? I am having trouble grasping what exactly this is correcting for.

Nitro
  • 523
  • 4
  • 12
  • I think a much better answer is here (https://stats.stackexchange.com/questions/192482/what-is-the-purpose-of-the-scaling-factor-used-in-dropout), where along with the explanation, link to original paper discussing the "mean-network" is discussed. – f3n1Xx Jul 12 '21 at 22:55

1 Answers1

5

Consider the case you have dropout with p probability where p is (0,1] ,

the expected value of an output feature is p*E(WT+x),as only p units are used, say if feature>=4 then class A else B, now for the same input if in test time you do not have any dropout the Expected value of the activation is: E(WT+x) as all units are used, thus to prevent the decision boundary from shifting you reweigh the weights by 1/p to keep the expected activation same at the final layer.

In short you are doing weighted average(and not the addition) of the exponential set of networks learnt with dropout.

Amitoz Dandiana
  • 536
  • 2
  • 5
  • It seems to me that if dropout is used only in the last layer (as it is usually the case for convnets with a fully-connected layer at the end), then the scaling is not needed, because it will not affect choosing the largest output of the model (for classification task). – MichaelSB May 05 '19 at 00:37
  • @MichaelSB not if you don't use softmax – crypdick Jul 11 '20 at 01:24
  • @crypdick why would it matter if we use softmax or not? I still don't see any need for scaling if we only use dropout in the last layer. – MichaelSB Jul 11 '20 at 10:49