I can imagine that by setting a learning rate per layer (in the gradient descent update rule) it would be possible to manage the vanishing gradient problem better than when using a single learning rate. Are there any specific techniques dealing with this? I've had a look on google scholar but couldn't find any particular techniques solely designed for this purpose.
1 Answers
RMSProp and Adam both adapt learning rates on a per-parameter basis. They attempt to improve upon Adagrad, which will decrease learning rates monotonously to 0.
RMSProp divides the learning rate by an exponentially-decaying average of squared gradients.
Adam tracks the first and second moments of gradient updates to improve upon RMSProp. But Adam isn't a cure-all, either; there is some evidence that Adam sometimes does worse than generic SGD with momentum in terms of generalization. (See: No change in accuracy using Adam Optimizer when SGD works fine)
An interesting recent overview of how to use Adam to achieve fast, good results can be found here: http://www.fast.ai/2018/07/02/adam-weight-decay/
The authors find that modifications to how Adam works with weight decay and some other tricks can provide dramatic speedups to network training.

- 76,417
- 20
- 189
- 313