How do coordinate wise meta-learning optimizers update learner networks?

Question

I was reading Optimization as a model for few shot learning and Learning to learn by gradient descent by gradient descent as I noticed both papers use something they call

coordinate wise optimzers

for what seems good reasons for efficiency. However, does that mean that their LSTM models are 1D vectors because they only output one coordinate at a time AND because they share parameters across coordinates too?

I guess my main clarification and worry is this:

In the paper they say they share parameters across coordinates. In particular what I am worried about is that forget and update gates take $theta_{t-1}$ as input and thus if there are many parameters then their matrices would be enormous. For example recall one of the forget/update gate equations:

$$ i^{<t>} =\sigma( W_I [\tilde \nabla^{<t>}, \mathcal L^{<t>}, \theta^{<t>}, i^{<t-1>} ] + b_I )$$ $$ f^{<t>} =\sigma( W_F [\tilde \nabla^{<t>}, \mathcal L^{<t>}, \theta^{<t>}, f^{<t-1>} ] + b_F )$$

Does that mean that for each forget and update gate get they actually only receive 1 single coordinate at a time? i.e. the equations should be (for coordinate j):

$$ i^{<t>}_j =\sigma( W_I [\tilde \nabla^{<t>}, \mathcal L^{<t>}_j, \theta^{<t>}_j, i^{<t-1>}_j ] + b_I )$$ $$ f^{<t>}_i =\sigma( W_F [\tilde \nabla^{<t>}_j, \mathcal L^{<t>}, \theta^{<t>}_j, f^{<t-1>}_j ] + b_F )$$

so when they write $theta_t$ in any part of the paper is it actually just a single number $\theta^{<t>}_j$ (i.e. a single coordinate)? is that correct?

cross-posted:

score 0 · Answer 1 · answered Mar 05 '20 at 00:02

Yes it's coordinate wise. But the trick is that it can be made into a vectorized for as shown here in it's pytorch implementation (https://github.com/markdtw/meta-learning-lstm-pytorch/blob/master/metalearner.py).

The main this is to write it like this (as it's actually implemented):

$$ f^{<t>} = \sigma( W^F_{:,j}[ h^{<t>}_{lstm, j} , \theta^{<t-1>}_j, f^{<t-1>} ] + b^F_j )$$

$$ f^{<t>} = \sigma( W^F_{:,j}[ lstm(\nabla^{<t>}_j , \mathcal L_{j} )_j , \theta^{<t-1>}_j, f^{<t-1>}_j ] + b^F_j )$$

One can vectorize it as follows:

$$ f^{<t>} = \sigma( W^F[ lstm(\nabla^{<t>} , \mathcal L ) , \theta^{<t-1>}, f^{<t-1>} ] + b^F )$$

where $h^{<t>}_{lstm} = lstm(\nabla^{<t>})_j$ is of size $| \theta| \times h_{lstm}$ where $h_{lstm}$ is the size of the lstm in the first layer and $\theta^{<t>}, f^{<t>}$ are of size $|\theta| \times 1$. Note the concatenation makes sure that the $[ lstm(\nabla^{<t>} , \mathcal L ) , \theta^{<t-1>}, f^{<t-1>} ]$ is of size $h_{lstm} \times |\theta|$. That is the main trick and notice that $W^F$ is of size $h_{lstm} \times 1$ and thus the fully connected layer not explode in dimensionality which is crucial!

So the important matrix multiplication:

$$ W^F[ lstm(\nabla^{<t>} , \mathcal L ) , \theta^{<t-1>}, f^{<t-1>} ] $$

is $ ( 1 \times h_{lstm})\times (h_{lstm} \times |\theta|)$

where the lstm parameters is of size $( 1 \times h_{lstm})$ and does NOT include the # of parameters $|\theta|$ in it.

Also, adding the bias is trivial since pytorch just "copies" and adds it to every location automagically in GPU.

How do coordinate wise meta-learning optimizers update learner networks?

1 Answers1