I was reading Optimization as a model for few shot learning and Learning to learn by gradient descent by gradient descent as I noticed both papers use something they call
coordinate wise optimzers
for what seems good reasons for efficiency. However, does that mean that their LSTM models are 1D vectors because they only output one coordinate at a time AND because they share parameters across coordinates too?
I guess my main clarification and worry is this:
In the paper they say they share parameters across coordinates. In particular what I am worried about is that forget and update gates take $theta_{t-1}$ as input and thus if there are many parameters then their matrices would be enormous. For example recall one of the forget/update gate equations:
$$ i^{<t>} =\sigma( W_I [\tilde \nabla^{<t>}, \mathcal L^{<t>}, \theta^{<t>}, i^{<t-1>} ] + b_I )$$ $$ f^{<t>} =\sigma( W_F [\tilde \nabla^{<t>}, \mathcal L^{<t>}, \theta^{<t>}, f^{<t-1>} ] + b_F )$$
Does that mean that for each forget and update gate get they actually only receive 1 single coordinate at a time? i.e. the equations should be (for coordinate j):
$$ i^{<t>}_j =\sigma( W_I [\tilde \nabla^{<t>}, \mathcal L^{<t>}_j, \theta^{<t>}_j, i^{<t-1>}_j ] + b_I )$$ $$ f^{<t>}_i =\sigma( W_F [\tilde \nabla^{<t>}_j, \mathcal L^{<t>}, \theta^{<t>}_j, f^{<t-1>}_j ] + b_F )$$
so when they write $theta_t$ in any part of the paper is it actually just a single number $\theta^{<t>}_j$ (i.e. a single coordinate)? is that correct?
cross-posted: