Questions tagged [adam]

An adaptive algorithm for gradient-based optimization of stochastic objective functions, often used to train deep neural networks.

The Adam optimizer was first proposed in "ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION" by Diederik P. Kingma, Jimmy Lei Ba.

61 questions
58
votes
6 answers

Adam optimizer with exponential decay

In most Tensorflow code I have seen Adam Optimizer is used with a constant Learning Rate of 1e-4 (i.e. 0.0001). The code usually looks the following: ...build the model... # Add the optimizer train_op =…
49
votes
1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…
daniel451
  • 2,635
  • 6
  • 22
  • 26
32
votes
2 answers

What is the reason that the Adam Optimizer is considered robust to the value of its hyper parameters?

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Bengio, Goodfellow and Courville: Adam is generally regarded as being fairly robust to the choice of hyper parameters,…
27
votes
2 answers

Explanation of Spikes in training loss vs. iterations with Adam Optimizer

I am training a neural network using i) SGD and ii) Adam Optimizer. When using normal SGD, I get a smooth training loss vs. iteration curve as seen below (the red one). However, when I used the Adam Optimizer, the training loss curve has some…
Abdul Fatir
  • 373
  • 1
  • 3
  • 8
25
votes
2 answers

Why is it important to include a bias correction term for the Adam optimizer for Deep Learning?

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Begnio, Goodfellow and Courtville: Adam includes bias corrections to the estimates of both the first-order moments…
19
votes
1 answer

RMSProp and Adam vs SGD

I am performing experiments on the EMNIST validation set using networks with RMSProp, Adam and SGD. I am achieving 87% accuracy with SGD(learning rate of 0.1) and dropout (0.1 dropout prob) as well as L2 regularisation (1e-05 penalty). When testing…
16
votes
2 answers

The reason of superiority of Limited-memory BFGS over ADAM solver

I am using Multilayer Perceptron MLPClassifier for training a classification model for my problem. I noticed that using the solver lbfgs (I guess it implies Limited-memory BFGS in scikit learn) outperforms ADAM when the dataset is relatively small…
13
votes
1 answer

How to choose between SGD with Nesterov momentum and Adam?

I'm currently implementing a neural network architecture on Keras. I would like to optimize the training time, and I'm considering using alternative optimizers such as SGD with Nesterov Momentum and Adam. I've read several things about the pros and…
Clément F
  • 1,717
  • 4
  • 12
  • 13
10
votes
0 answers

what is the mistake of convergence proof in Adam

Sashank J. Reddi et. al in their paper "On the convergence of Adam and beyond" say that, Adam's proof of convergence as stated in original paper is wrong. More than that, they point out that the value $Г_{t + 1} = \frac{\sqrt{V_{t+1}}}{a_{t+1}} -…
9
votes
2 answers

Training a neural network on chess data

I have been writing a chess engine with a friend and the engine itself is really good already (2700+ CCRL). We had the idea to use a neural network to have a better evaluation of positions. Input to the network because the output of the network…
9
votes
2 answers

How does batch size affect Adam Optimizer?

What impact does mini-batch size have on Adam Optimizer? Is there a recommended mini-batch size when training a (covolutional) neural network with Adam Optimizer? From what I understood (I might be wrong), for small mini-batch sizes the results tend…
Hello Lili
  • 319
  • 1
  • 3
  • 9
9
votes
1 answer

Is manually tuning learning rate during training redundant with optimization methods like Adam?

I have seen some high-profile deep learning papers where an optimization method like Adam was used, yet the learning rate was manually changed at specific iterations. What is the relationship between the adaptivity provided by adaptive optimization…
Sami Liedes
  • 445
  • 1
  • 4
  • 7
8
votes
2 answers

How well should I expect Adam to work?

I've been coding up a neural network package for my own amusement, and it seems to work. I've been reading about Adam and from what I've seen it's very difficult to beat. Well, when I implement the Adam algorithm in my code it does terribly -…
8
votes
1 answer

What does Diagonal Rescaling of the gradients mean in ADAM paper?

I was reading the original paper on ADAM (Adam: A Method for Stochastic Optimization), which mentions: [...] invariant to diagonal rescaling of the gradients, [...] What does it mean? Also, another paper - Normalized Direction-preserving Adam -…
7
votes
1 answer

What does decay_steps mean in Tensorflow tf.train.exponential_decay?

I am trying to implement an exponential learning rate decay with the Adam optimizer for a LSTM. I do not want the 'staircase = true' version. The decay_steps for me feels like the number of steps that the learning rate keeps constant. But I am not…
1
2 3 4 5