I'm trying to understand this paper: "Multiple Object Recognition With Visual Attention (Ba et al., 2015)", specifically I'm trying to understand section 3. which explains how the model is trained.
Please let me jump right to the question, without re-introducing the notation from the paper.
Authors state:
We can maximize likelihood of the class label by marginalizing over the glimpse locations: $$ \log\ p(y\ |\ I, W) = \log\sum_l p(l\ |\ I, W)\ > p(y\ |\ l, I, W) $$
The marginalized objective function can be learned through optimizing its variational free energy lower bound $\mathcal{F}$: $$ \log\sum_l p(l\ |\ I, W)\ p(y\ |\ l, I, W) \geq \sum_l p(l\ |\ I, W)\ \log\ p(y\ |\ l, I, W) $$
The learning rules can be derived by taking derivatives of the above free energy with respect to the model parameter W: $$ \frac{∂\mathcal{F}}{∂W} = \sum_l p(l\ |\ I, W) \left[ \frac{∂\ \log p(y\ |\ l, I, W)}{∂W} + \log p(y\ |\ l, I, W) \frac{∂\ \log p(l\ |\ I, W)}{∂W} \right] $$
Which can be approximated:
$$ \frac{1}{M}\sum_{m=1}^M \left[ \frac{∂\ \log p(y\ |\ l^m, I, W)}{∂W} + \log\ p(y\ |\ l^m, I, W) \frac{∂\ \log\ p(l^m\ |\ I, W)}{∂W} \right] $$
I don't really see how this learning rule allows the net to learn where to attend. The way I think about this now is: - whenever the network happens to attend to a relevant piece of image, $ \log\ p(y) $ is quite large, therefore the gradient is updated in a way to maximze $\log\ p(l^m)$ but I don't think this is correct.
Would it be possible for someone to provide me with an intuitive explanation about what's going on, or at least point me to relevant literature?