Vanishing gradient and learning rate (neural networks)

Question

I can imagine that by setting a learning rate per layer (in the gradient descent update rule) it would be possible to manage the vanishing gradient problem better than when using a single learning rate. Are there any specific techniques dealing with this? I've had a look on google scholar but couldn't find any particular techniques solely designed for this purpose.

Sycorax · Answer 1 · 2018-09-26T05:09:15.237

RMSProp and Adam both adapt learning rates on a per-parameter basis. They attempt to improve upon Adagrad, which will decrease learning rates monotonously to 0.

RMSProp divides the learning rate by an exponentially-decaying average of squared gradients.

Adam tracks the first and second moments of gradient updates to improve upon RMSProp. But Adam isn't a cure-all, either; there is some evidence that Adam sometimes does worse than generic SGD with momentum in terms of generalization. (See: No change in accuracy using Adam Optimizer when SGD works fine)

An interesting recent overview of how to use Adam to achieve fast, good results can be found here: http://www.fast.ai/2018/07/02/adam-weight-decay/

The authors find that modifications to how Adam works with weight decay and some other tricks can provide dramatic speedups to network training.

Vanishing gradient and learning rate (neural networks)

1 Answers1