3

I can imagine that by setting a learning rate per layer (in the gradient descent update rule) it would be possible to manage the vanishing gradient problem better than when using a single learning rate. Are there any specific techniques dealing with this? I've had a look on google scholar but couldn't find any particular techniques solely designed for this purpose.

Olivier_s_j
  • 1,055
  • 2
  • 11
  • 25

1 Answers1

1

RMSProp and Adam both adapt learning rates on a per-parameter basis. They attempt to improve upon Adagrad, which will decrease learning rates monotonously to 0.

RMSProp divides the learning rate by an exponentially-decaying average of squared gradients.

Adam tracks the first and second moments of gradient updates to improve upon RMSProp. But Adam isn't a cure-all, either; there is some evidence that Adam sometimes does worse than generic SGD with momentum in terms of generalization. (See: No change in accuracy using Adam Optimizer when SGD works fine)

An interesting recent overview of how to use Adam to achieve fast, good results can be found here: http://www.fast.ai/2018/07/02/adam-weight-decay/

The authors find that modifications to how Adam works with weight decay and some other tricks can provide dramatic speedups to network training.

Sycorax
  • 76,417
  • 20
  • 189
  • 313