There are many papers on how Adagrad is used in SGD, but I have not seen any where it is applied in batch descent.
I have a situation wherein batch gradient descent is faster than SGD (unique to my problem).
So far I am simply using a optimization package that does LBFGS optimization. This works ok, but LBFGS only does line search for a scalar learning rate. With Adagrad i could get learning rates per dimension of my parameter vector which seems better than a scalar learning rate.
My question is - is there any reason NOT to use Adagrad in batch gradient descent?