Why does Geoffrey Hinton say in his Coursera course that gradient magnitudes can vary widely when training Neural Networks?

Question

I was watching his coursera course video on RMSProp, and he said in a paraphrase:

Gradient magnitudes vary widely.

I was wondering, why is it that they vary widely? I had a guess but wanted to understand what he meant on the video. My guess is:

Stochastic Gradient Descent gives it a variance because of its randomness.
The activation functions for some reason gave it this property too (not sure why or which ones do and which ones don't)

I was wondering if those were the reasons, and even if they were, why and how they contributed to this issue, or were there additional explanations that I was overlooking?

For anyone interested, [here](https://www.youtube.com/watch?v=AM9c7zN2KwU&t=39s) is where Hinton starts the sentence with the quote. — Oren Milman, Sep 28 '18 at 04:25
Just a quick intuitive explanation: 1. (sub)derivative operation is not bounded on continuous functions; 2. recursive multivariate multiplications can easily explode or vanish or oscillate widely. You can try to simulate a VAR model with dimension 20 or 50 to get a feeling of this. — user112758, Feb 05 '19 at 05:17

score 1 · Answer 1 · answered Oct 01 '18 at 07:19

(Disclaimer: I didn't ask Hinton, and I didn't find further explanations in the video, so a guess is the best I can do.)

To clarify, by "gradients" Hinton refers here to the partial derivatives of the cost function with regard to different parameters (i.e. weights and biases).

My guess is that Hinton had the vanishing/exploding gradient problem in mind when he said that.

See this answer for an informal explanation about the vanishing gradient problem (the exploding gradient problem is the opposite, but the idea is quite similar).
In short, vanishing/exploding gradients means that the magnitudes of the gradients become smaller/bigger (respectively) as you move to earlier layers. I.e. the magnitudes of the partial derivatives of the cost function with regard to parameters in earlier layers are more extreme.

In the chapter Why are deep neural networks hard to train? (in the book Neural Networks and Deep Learning) Michael Nielsen explains:

[...] the gradient in early layers is the product of terms from all the later layers. When there are many layers, that's an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying reason for that balancing to occur, it's highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an unstable gradient problem.

This "product of terms from all the later layers" that Nielsen mentions (he explained more about it earlier in the chapter) is mostly composed of derivatives of the activation function and weights (at least in simple feedforward networks).
Read the whole chapter (which isn't that long, and very well-written, in my opinion) and see this answer for a more rigorous explanation about this "product of terms from all the later layers".

With regard to your guess:

I don't think SGD adds a significant amount of noise to the magnitudes of the gradient's components. In practice, when we train a neural network using SGD, we take each step according to a mini-batch of training examples (and not a single example). Thus, I think that the approximation of the gradient that we calculate isn't significantly different from the actual gradient.
As explained in this answer, the activation functions play an important role here. E.g. consider the sigmoid function. It's derivative is at most $0.25$, and thus in earlier layers, the "product of terms from all the later layers" would contain many derivatives of sigmoid. This would make the gradient more likely to vanish in earlier layers.

Why does Geoffrey Hinton say in his Coursera course that gradient magnitudes can vary widely when training Neural Networks?

1 Answers1