Intuitively dropout make sense to me but I don't understand how backpropagation works in presence of dropout.
It looks like at each training step we backpropagate gradients to parameters in the thinned network and ignore the dropped paths as if there was nothing stochastic in the network. Why is this valid? Shouldn't we apply something similar to reparameterization trick or gumbel-softmax approaches to backpropagate through dropout layers?