Questions tagged [adagrad]

AdaGrad (for adaptive gradient algorithm) is an enhanced stochastic gradient descent algorithm that automatically determines a per-parameter learning rate.

15 questions
5
votes
0 answers

Adagrad for batch gradient descent

There are many papers on how Adagrad is used in SGD, but I have not seen any where it is applied in batch descent. I have a situation wherein batch gradient descent is faster than SGD (unique to my problem). So far I am simply using a optimization…
A.D
  • 2,114
  • 3
  • 17
  • 27
4
votes
1 answer

Should you use optimization algorithms like Adagrad and ADAM for neural network online training?

Optimization algorithms like Adagrad and ADAM decay your learning rate over time. To me this sounds like a bad idea for online training since you're always getting new data as opposed to retraining on the same data for multiple epochs in…
user3768533
  • 141
  • 1
2
votes
1 answer

How to perform adagrad stochastic gradient descent (SGD) on word2vec?

AdaGrad is an enhanced SGD that automatically determines a per-parameter learning rate. However, in word2vec, there's no clear "parameter" to perform adagrad on. So what's the closest algorithm to adagrad for word2vec?
1
vote
0 answers

Momentum vs adaptive step methods

My understanding is that: With momentum, one can avoid e.g. "zig-zags" during gradient decent by averaging gradients to determine a better direction of descent. With adaptive step size methods like AdaGrad and RMSProp one accumulates gradients and…
Josh
  • 3,408
  • 4
  • 22
  • 46
1
vote
0 answers

Momentum vs Polyak averaging

I'm going through this deck but don't quite get the difference between momentum and Polyak averaging, and what role Polyak averaging plays in modern optimizers. For example, is it correct to say that in momentum one averages parameter gradients…
Josh
  • 3,408
  • 4
  • 22
  • 46
1
vote
1 answer

Same exact model: converges with adagrad, diverges with adadelta

I have a very simple model built using Keras. What strikes me as surprising is that the very same training configs converge (i.e. training loss goes down with every epoch) when the model uses the adagrad optimizer but it diverges when I use…
Felipe
  • 990
  • 2
  • 10
  • 18
1
vote
0 answers

Behavior of AdaGrad without the square root in the denominator

Multiple articles claim that AdaGrad does not work well when the square-root in the formula is not taken. This is one such example. $\theta_{t+1,i} = \theta_{t,i}-\dfrac{\eta}{\sqrt{G_{t,ii}+\epsilon}}\times g_{t,i}$. Here $G_{t,ii}$ represents the…
1
vote
1 answer

Intuition behind learning rate scheduling in AdaDelta

To get rid of the problems in AdaGrad, the learning rate is changed from $\frac{\eta}{\sqrt{G_{t, ii}+\epsilon}}$ to include only gradients in a small window size $w$. But as the function approaches the optimal value, won't the denominator become…
1
vote
0 answers

Why does Adagrad improve the robustness of SGD?

I mainly read this blog. And this blog sites this paper for the statement that Adagrad improved the robustness of SGD. I have tried to check the original paper or other articles which explains why Adagrad improves the robustness compared with…
1
vote
1 answer

Adagrad Expression about Element-wise matrix vector multiplication

Sometimes, Adagrad is expressed like this $\mathbf{x}^{t+1} = \mathbf{x}^t –[{η/√{G^t + ε}}]$ ⊙ ∇E where G is a diagonal matrix. Accoding to wiki, Hadamard product is only defined when two matrxes shape are same. However, some libraries make those…
1
vote
0 answers

Treating Categorical Variables as Continuous for Random Forest / Adaboost

What's the correct way to deal with categorical variables in packages like sklearn's RF and xgboost? Is there any cons of treating the variables are continuous? E.g. encode class A as 1, class B as 2, class C as 3?
1
vote
0 answers

What happens if we use training data in reverse chronological order?

My SGD-Adagrad algorithm uses chronological data for training for future predictions. The test and validation data has occured after training data. What happens if I use training data in reverse chronological order? What I think should happen: Since…
Swapniel
  • 145
  • 6
0
votes
1 answer

When do Adaptive Optimization Algorithms modify their parameters?

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?
0
votes
1 answer

xgboost: get error for each iteration

in XGBoost, is there a way to programmatically get the training and evaluation error per iteration of training? Will train until eval error hasn't decreased in 25 rounds. [0] train-rmspe:0.996873 eval-rmspe:0.996881 [1] train-rmspe:0.981762 …
0
votes
1 answer

Divergence in Stochastic Gradient Descent

I am using Stochastic Gradient Descent with ADAGRAD. I am training on a training set of 1.6 billion examples. After about 30 million examples, the training loss starts increasing after reaching a low. The examples in training set are ordered…
Swapniel
  • 145
  • 6