20

I have been reading a lot about convoloutional neural networks and was wondering how they avoid the vanishing gradient problem. I know deep belief networks stack single level auto-encoders or other pre-trained shallow networks and can thus avoid this problem but I don't know how it is avoided in CNNs.

According to Wikipedia:

"despite the above-mentioned "vanishing gradient problem," the superior processing power of GPUs makes plain back-propagation feasible for deep feedforward neural networks with many layers."

I don't understand why GPU processing would remove this problem?

Blaszard
  • 103
  • 1
  • 1
  • 5
Aly
  • 1,149
  • 2
  • 15
  • 24
  • 2
    Did the wikipedia article not justify why GPU help to address the vanishing gradient problem? Is it because even though the gradients are small, since GPUs are so fast we still manage to improve the parameters by doing lots of steps thanks to the GPUs? – Charlie Parker Aug 24 '16 at 01:12
  • 2
    Exactly. Vanishing gradient problem is the reason why lower layer weights are updated at a very small rate, and thus it takes forever to train the network. But, as with GPUs you can do more computations (i.e. more updates to the weights) in lesser time, with more and more GPU processing, vanishing gradient problem is somewhat *vanished* to some extent. – exAres Nov 30 '16 at 08:23
  • @CharlieParker, could you elaborate on `GPU's are fast correlated with vanishing gradients`, I can understand the fast logic with large memory bandwidth to process multiple matrix multiplications! but could you please explain what it has to do with the derivatives? The [vanishing gradient issue seems to do more with weight initialization](https://stats.stackexchange.com/q/390648/157252), isn't it! – Anu Feb 05 '19 at 19:33
  • Because vanishing gradient is not really a problem, it is a feature of Buttefly effect. A derivative of a derivative of a derivative of a derivative, all those are multiplied and the multiplication by fraction get another smaller fraction. GPUs are just faster to train, but if you use CPU you would achieve the same results as GPU if you train it for a few months. – Nulik Apr 07 '20 at 17:24

1 Answers1

16

The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge. This is a problem if you have a slow computer which takes a long time for each step. If you have a fast GPU which can perform many more steps in a day, this is less of a problem.

There are several ways to tackle the vanishing gradient problem. I would guess that the largest effect for CNNs came from switching from sigmoid nonlinear units to rectified linear units. If you consider a simple neural network whose error $E$ depends on weight $w_{ij}$ only through $y_j$, where

$$y_j = f\left( \sum_iw_{ij}x_i \right),$$

its gradient is

\begin{align} \frac{\partial}{\partial w_{ij}} E &= \frac{\partial E}{\partial y_j} \cdot \frac{\partial y_j}{\partial w_{ij}} \\ &= \frac{\partial E}{\partial y_j} \cdot f'\left(\sum_i w_{ij} x_i\right) x_i. \end{align}

If $f$ is the logistic sigmoid function, $f'$ will be close to zero for large inputs as well as small inputs. If $f$ is a rectified linear unit,

\begin{align} f(u) = \max\left(0, u\right), \end{align} the derivative is zero only for negative inputs and 1 for positive inputs. Another important contribution comes from properly initializing the weights. This paper looks like a good source for understanding the challenges in more details (although I haven't read it yet):

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

Lucas
  • 5,692
  • 29
  • 39
  • 2
    I'm a little puzzled about the rectified linear units. Yes, for sigmoids etc. the gradient is often very small - but for rectified linear units it is often exactly zero. Isn't that worse? Thus, if the weights of an unit are unfortunate, they will never ever change. – Hans-Peter Störr Sep 15 '15 at 19:55
  • 2
    Thinking about this, leaky and/or noisy ReLUs might be in use for that reason. – sunside Aug 16 '16 at 21:26
  • 7
    Why is your first sentence true? I.e. "The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge." Why do we need small learning rates to deal with the vanishing gradient problem? If the gradients are already small with due to vanishing gradients I would have expected that making them small only made things worse. – Charlie Parker Aug 24 '16 at 01:11
  • 2
    Good question, I should have explained that statement better. The vanishing gradient problem is not that all gradients are small (which we could easily fix by using large learning rates), but that the gradients vanish as you backpropagate through the network. I.e., the gradients are small in some layers but large in other layers. If you use large learning rates, the whole thing explodes (because some gradients are large), so you have to use a small learning rate. Using multiple learning rates is another approach to addressing the problem, at the cost of introducing more hyperparameters. – Lucas Aug 24 '16 at 15:16
  • 5
    I would argue that the learning rate is mostly tied to the *exploding* gradient problem. Scaling the gradient down with an exaggeratingly low learning rate does not at all prevent vanishing gradients, it just delays the effect as learning slows down considerably. The effect itself is caused by the repeated application of nonlinearities and multiplication of small values. Of course there is a trend to go to smaller learning rates (due to computing power) but that has nothing to do with vanishing gradients as it only controls how well the state space is explored (given stable conditions). – runDOSrun May 04 '17 at 09:10
  • 1
    This is not correct. Small LR is required because of ReLU, not vanishing gradients. You can argue that ReLU is needed because saturating non-linearity takes exponentially longer time to arrive at same accuracy while ReLU + small LR takes longer but not exponentially longer. – Shital Shah May 11 '18 at 06:01
  • @runDOSrun I did not say that small learning rates prevent vanishing gradients. Using _large learning_ rates would counter the effect of vanishing gradients in later layers. But we can't just use large learning rates, because some parameters still have large gradients. So we use learning rates that are too small for the later layers, leading to slow convergence. – Lucas Jun 23 '18 at 15:47