Highest Voted 'adagrad' Questions - Statistical Analysis Stack Exchange

5

votes

0 answers

Adagrad for batch gradient descent

There are many papers on how Adagrad is used in SGD, but I have not seen any where it is applied in batch descent. I have a situation wherein batch gradient descent is faster than SGD (unique to my problem). So far I am simply using a optimization…

gradient-descent adagrad

asked May 06 '15 at 18:01

A.D

2,114
3
17
27

4

votes

1 answer

Should you use optimization algorithms like Adagrad and ADAM for neural network online training?

Optimization algorithms like Adagrad and ADAM decay your learning rate over time. To me this sounds like a bad idea for online training since you're always getting new data as opposed to retraining on the same data for multiple epochs in…

neural-networks optimization adam adagrad

asked Dec 05 '16 at 19:46

user3768533

141
1

2

votes

1 answer

How to perform adagrad stochastic gradient descent (SGD) on word2vec?

AdaGrad is an enhanced SGD that automatically determines a per-parameter learning rate. However, in word2vec, there's no clear "parameter" to perform adagrad on. So what's the closest algorithm to adagrad for word2vec?

gradient-descent word2vec word-embeddings adagrad

asked Nov 12 '15 at 22:33

jxieeducation

161
6

1

vote

0 answers

Momentum vs adaptive step methods

My understanding is that: With momentum, one can avoid e.g. "zig-zags" during gradient decent by averaging gradients to determine a better direction of descent. With adaptive step size methods like AdaGrad and RMSProp one accumulates gradients and…

neural-networks optimization gradient-descent adagrad

asked Jul 14 '20 at 14:01

Josh

3,408
4
22
46

1

vote

0 answers

Momentum vs Polyak averaging

I'm going through this deck but don't quite get the difference between momentum and Polyak averaging, and what role Polyak averaging plays in modern optimizers. For example, is it correct to say that in momentum one averages parameter gradients…

neural-networks optimization adam adagrad

asked Jul 13 '20 at 21:02

Josh

3,408
4
22
46

1

vote

1 answer

Same exact model: converges with adagrad, diverges with adadelta

I have a very simple model built using Keras. What strikes me as surprising is that the very same training configs converge (i.e. training loss goes down with every epoch) when the model uses the adagrad optimizer but it diverges when I use…

machine-learning neural-networks optimization keras adagrad

asked Feb 28 '18 at 02:38

Felipe

990
2
10
18

1

vote

0 answers

Behavior of AdaGrad without the square root in the denominator

Multiple articles claim that AdaGrad does not work well when the square-root in the formula is not taken. This is one such example. $\theta_{t+1,i} = \theta_{t,i}-\dfrac{\eta}{\sqrt{G_{t,ii}+\epsilon}}\times g_{t,i}$. Here $G_{t,ii}$ represents the…

machine-learning optimization deep-learning stochastic-gradient-descent adagrad

asked Dec 14 '17 at 12:01

Ameet Deshpande

175
10

1

vote

1 answer

Intuition behind learning rate scheduling in AdaDelta

To get rid of the problems in AdaGrad, the learning rate is changed from $\frac{\eta}{\sqrt{G_{t, ii}+\epsilon}}$ to include only gradients in a small window size $w$. But as the function approaches the optimal value, won't the denominator become…

gradient-descent stochastic-gradient-descent adagrad

asked Nov 26 '17 at 10:20

Ameet Deshpande

175
10

1

vote

0 answers

Why does Adagrad improve the robustness of SGD?

I mainly read this blog. And this blog sites this paper for the statement that Adagrad improved the robustness of SGD. I have tried to check the original paper or other articles which explains why Adagrad improves the robustness compared with…

optimization stochastic-gradient-descent adagrad

asked Oct 21 '17 at 06:15

Kazuya Tomita

139
7

1

vote

1 answer

Adagrad Expression about Element-wise matrix vector multiplication

Sometimes, Adagrad is expressed like this $\mathbf{x}^{t+1} = \mathbf{x}^t –[{η/√{G^t + ε}}]$ ⊙ ∇E where G is a diagonal matrix. Accoding to wiki, Hadamard product is only defined when two matrxes shape are same. However, some libraries make those…

mathematical-statistics matrix adagrad

asked Oct 20 '17 at 15:48

Kazuya Tomita

139
7

1

vote

0 answers

Treating Categorical Variables as Continuous for Random Forest / Adaboost

What's the correct way to deal with categorical variables in packages like sklearn's RF and xgboost? Is there any cons of treating the variables are continuous? E.g. encode class A as 1, class B as 2, class C as 3?

scikit-learn cart categorical-encoding ensemble-learning adagrad

asked Dec 03 '15 at 19:08

jxieeducation

161
6

1

vote

0 answers

What happens if we use training data in reverse chronological order?

My SGD-Adagrad algorithm uses chronological data for training for future predictions. The test and validation data has occured after training data. What happens if I use training data in reverse chronological order? What I think should happen: Since…

optimization adagrad

asked Nov 26 '15 at 09:26

Swapniel

145
6

0

votes

1 answer

When do Adaptive Optimization Algorithms modify their parameters?

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?

optimization gradient-descent adam adagrad

asked Jun 03 '21 at 16:09

Marsellus Wallace

155
8

0

votes

1 answer

xgboost: get error for each iteration

in XGBoost, is there a way to programmatically get the training and evaluation error per iteration of training? Will train until eval error hasn't decreased in 25 rounds. [0] train-rmspe:0.996873 eval-rmspe:0.996881 [1] train-rmspe:0.981762 …

cart adaboost adagrad

asked Dec 08 '15 at 01:41

jxieeducation

161
6

0

votes

1 answer

Divergence in Stochastic Gradient Descent

I am using Stochastic Gradient Descent with ADAGRAD. I am training on a training set of 1.6 billion examples. After about 30 million examples, the training loss starts increasing after reaching a low. The examples in training set are ordered…

optimization gradient-descent adagrad

asked Nov 24 '15 at 12:35

Swapniel

145
6

Questions tagged [adagrad]