0

Intuitively dropout make sense to me but I don't understand how backpropagation works in presence of dropout.

It looks like at each training step we backpropagate gradients to parameters in the thinned network and ignore the dropped paths as if there was nothing stochastic in the network. Why is this valid? Shouldn't we apply something similar to reparameterization trick or gumbel-softmax approaches to backpropagate through dropout layers?

mcuk
  • 133
  • 4
  • From the post above: "In general, it's important to account for anything that you're doing in the forward step in the backward step as well – otherwise you're computing a gradient of a different function than you're evaluating." But are we doing that here? Is gradient of a Bernoulli even defined? – mcuk Apr 23 '18 at 16:01
  • The gradient is WRT to the $p$ probability not the $Y$ outcome so yes. – AdamO Apr 23 '18 at 16:38
  • Mmm, it still hasn't clicked for me. $p$ is not learnable so is there a point to gradient w.r.t $p$? I assumed we need gradient of dropout w.r.t its input so that we can pass the incoming gradient along and here it seems to be either 1 or 0 depending on the picked sampled value for dropout. – mcuk Apr 23 '18 at 17:29

0 Answers0