What's the intuition behind variational learning in Deep NNs with attention mechanism

Question

I'm trying to understand this paper: "Multiple Object Recognition With Visual Attention (Ba et al., 2015)", specifically I'm trying to understand section 3. which explains how the model is trained.

Please let me jump right to the question, without re-introducing the notation from the paper.

Authors state:

We can maximize likelihood of the class label by marginalizing over the glimpse locations: $$ \log\ p(y\ |\ I, W) = \log\sum_l p(l\ |\ I, W)\ > p(y\ |\ l, I, W) $$

The marginalized objective function can be learned through optimizing its variational free energy lower bound $\mathcal{F}$: $$ \log\sum_l p(l\ |\ I, W)\ p(y\ |\ l, I, W) \geq \sum_l p(l\ |\ I, W)\ \log\ p(y\ |\ l, I, W) $$

The learning rules can be derived by taking derivatives of the above free energy with respect to the model parameter W: $$ \frac{∂\mathcal{F}}{∂W} = \sum_l p(l\ |\ I, W) \left[ \frac{∂\ \log p(y\ |\ l, I, W)}{∂W} + \log p(y\ |\ l, I, W) \frac{∂\ \log p(l\ |\ I, W)}{∂W} \right] $$

Which can be approximated:

$$ \frac{1}{M}\sum_{m=1}^M \left[ \frac{∂\ \log p(y\ |\ l^m, I, W)}{∂W} + \log\ p(y\ |\ l^m, I, W) \frac{∂\ \log\ p(l^m\ |\ I, W)}{∂W} \right] $$

I don't really see how this learning rule allows the net to learn where to attend. The way I think about this now is: - whenever the network happens to attend to a relevant piece of image, $ \log\ p(y) $ is quite large, therefore the gradient is updated in a way to maximze $\log\ p(l^m)$ but I don't think this is correct.

Would it be possible for someone to provide me with an intuitive explanation about what's going on, or at least point me to relevant literature?

I am also trying to read this paper. I think if you also take into account that the network is also trying to learn p(l) [specifically mean and variance of the prob distribution]. so it learn p(l) and p(y|l) both together. After that it make sense. This book http://www.deeplearningbook.org/ specially chapter 20 has some explanation. — Master Yoda, Apr 15 '16 at 17:28

What's the intuition behind variational learning in Deep NNs with attention mechanism

0 Answers0