Highest Voted 'adam' Questions - Statistical Analysis Stack Exchange

58

votes

6 answers

Adam optimizer with exponential decay

In most Tensorflow code I have seen Adam Optimizer is used with a constant Learning Rate of 1e-4 (i.e. 0.0001). The code usually looks the following: ...build the model... # Add the optimizer train_op =…

asked Mar 05 '16 at 08:22

MarvMind

683
1
6
5

49

votes

1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…

neural-networks optimization gradient-descent adam

asked Jun 24 '16 at 15:45

daniel451

2,635
6
22
26

32

votes

2 answers

What is the reason that the Adam Optimizer is considered robust to the value of its hyper parameters?

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Bengio, Goodfellow and Courville: Adam is generally regarded as being fairly robust to the choice of hyper parameters,…

neural-networks deep-learning optimization hyperparameter adam

asked Aug 31 '16 at 18:27

Charlie Parker

5,836
11
57
113

27

votes

2 answers

Explanation of Spikes in training loss vs. iterations with Adam Optimizer

I am training a neural network using i) SGD and ii) Adam Optimizer. When using normal SGD, I get a smooth training loss vs. iteration curve as seen below (the red one). However, when I used the Adam Optimizer, the training loss curve has some…

neural-networks deep-learning adam

asked Sep 19 '17 at 09:30

Abdul Fatir

373
1
3
8

25

votes

2 answers

Why is it important to include a bias correction term for the Adam optimizer for Deep Learning?

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Begnio, Goodfellow and Courtville: Adam includes bias corrections to the estimates of both the first-order moments…

machine-learning neural-networks optimization conv-neural-network adam

asked Aug 31 '16 at 20:47

Charlie Parker

5,836
11
57
113

19

votes

1 answer

RMSProp and Adam vs SGD

I am performing experiments on the EMNIST validation set using networks with RMSProp, Adam and SGD. I am achieving 87% accuracy with SGD(learning rate of 0.1) and dropout (0.1 dropout prob) as well as L2 regularisation (1e-05 penalty). When testing…

machine-learning optimization stochastic-gradient-descent adam

asked Nov 26 '17 at 16:07

Alk

291
1
2
3

16

votes

2 answers

The reason of superiority of Limited-memory BFGS over ADAM solver

I am using Multilayer Perceptron MLPClassifier for training a classification model for my problem. I noticed that using the solver lbfgs (I guess it implies Limited-memory BFGS in scikit learn) outperforms ADAM when the dataset is relatively small…

machine-learning neural-networks optimization scikit-learn adam

asked Nov 25 '17 at 17:15

Steven

419
1
5
12

13

votes

1 answer

How to choose between SGD with Nesterov momentum and Adam?

I'm currently implementing a neural network architecture on Keras. I would like to optimize the training time, and I'm considering using alternative optimizers such as SGD with Nesterov Momentum and Adam. I've read several things about the pros and…

neural-networks stochastic-gradient-descent adam nesterov

asked Jun 02 '17 at 13:07

Clément F

1,717
4
12
13

10

votes

0 answers

what is the mistake of convergence proof in Adam

Sashank J. Reddi et. al in their paper "On the convergence of Adam and beyond" say that, Adam's proof of convergence as stated in original paper is wrong. More than that, they point out that the value $Г_{t + 1} = \frac{\sqrt{V_{t+1}}}{a_{t+1}} -…

machine-learning neural-networks optimization adam

asked Sep 30 '18 at 11:34

Виталик Бушаев

233
1
2
8

9

votes

2 answers

Training a neural network on chess data

I have been writing a chess engine with a friend and the engine itself is really good already (2700+ CCRL). We had the idea to use a neural network to have a better evaluation of positions. Input to the network because the output of the network…

machine-learning neural-networks python large-data adam

asked Jul 26 '20 at 07:29

Finn Eggers

291
1
8

9

votes

2 answers

How does batch size affect Adam Optimizer?

What impact does mini-batch size have on Adam Optimizer? Is there a recommended mini-batch size when training a (covolutional) neural network with Adam Optimizer? From what I understood (I might be wrong), for small mini-batch sizes the results tend…

conv-neural-network adam

asked Oct 17 '17 at 14:49

Hello Lili

319
1
3
9

9

votes

1 answer

Is manually tuning learning rate during training redundant with optimization methods like Adam?

I have seen some high-profile deep learning papers where an optimization method like Adam was used, yet the learning rate was manually changed at specific iterations. What is the relationship between the adaptivity provided by adaptive optimization…

machine-learning neural-networks optimization adam

asked Jun 22 '17 at 13:05

Sami Liedes

445
1
4
7

8

votes

2 answers

How well should I expect Adam to work?

I've been coding up a neural network package for my own amusement, and it seems to work. I've been reading about Adam and from what I've seen it's very difficult to beat. Well, when I implement the Adam algorithm in my code it does terribly -…

machine-learning neural-networks optimization adam

asked Mar 16 '19 at 13:37

Joseph Barnett

83
3

8

votes

1 answer

What does Diagonal Rescaling of the gradients mean in ADAM paper?

I was reading the original paper on ADAM (Adam: A Method for Stochastic Optimization), which mentions: [...] invariant to diagonal rescaling of the gradients, [...] What does it mean? Also, another paper - Normalized Direction-preserving Adam -…

neural-networks mathematical-statistics optimization adam

asked Aug 02 '18 at 05:38

thelogicalkoan

181
4

7

votes

1 answer

What does decay_steps mean in Tensorflow tf.train.exponential_decay?

I am trying to implement an exponential learning rate decay with the Adam optimizer for a LSTM. I do not want the 'staircase = true' version. The decay_steps for me feels like the number of steps that the learning rate keeps constant. But I am not…

neural-networks deep-learning gradient-descent tensorflow adam

asked Jan 07 '19 at 05:01

Suleka_28

235
2
7

Questions tagged [adam]