(Disclaimer: I didn't ask Hinton, and I didn't find further explanations in the video, so a guess is the best I can do.)
To clarify, by "gradients" Hinton refers here to the partial derivatives of the cost function with regard to different parameters (i.e. weights and biases).
My guess is that Hinton had the vanishing/exploding gradient problem in mind when he said that.
See this answer for an informal explanation about the vanishing gradient problem (the exploding gradient problem is the opposite, but the idea is quite similar).
In short, vanishing/exploding gradients means that the magnitudes of the gradients become smaller/bigger (respectively) as you move to earlier layers. I.e. the magnitudes of the partial derivatives of the cost function with regard to parameters in earlier layers are more extreme.
In the chapter Why are deep neural networks hard to train? (in the book Neural Networks and Deep Learning) Michael Nielsen explains:
[...] the gradient in early layers is the product of terms from all the later layers. When there are many layers, that's an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying reason for that balancing to occur, it's highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an unstable gradient problem.
This "product of terms from all the later layers" that Nielsen mentions (he explained more about it earlier in the chapter) is mostly composed of derivatives of the activation function and weights (at least in simple feedforward networks).
Read the whole chapter (which isn't that long, and very well-written, in my opinion) and see this answer for a more rigorous explanation about this "product of terms from all the later layers".
With regard to your guess:
- I don't think SGD adds a significant amount of noise to the magnitudes of the gradient's components. In practice, when we train a neural network using SGD, we take each step according to a mini-batch of training examples (and not a single example). Thus, I think that the approximation of the gradient that we calculate isn't significantly different from the actual gradient.
- As explained in this answer, the activation functions play an important role here. E.g. consider the sigmoid function. It's derivative is at most $0.25$, and thus in earlier layers, the "product of terms from all the later layers" would contain many derivatives of sigmoid. This would make the gradient more likely to vanish in earlier layers.