2

In the book "Deep Learning" by Goodfellow, Bengio, and Courville, I do not understand the following statement about why nonlinearities in deep neural nets give rise to very high derivatives:

The objective function for highly nonlinear deep neural networks or for recurrent neural networks often contains sharp nonlinearities in parameter space resulting from the multiplication of several parameters. These nonlinearities give rise to very high derivatives in some places. When the parameters get close to such a cliff region, a gradient descent update can catapult the parameters very far, possibly losing most of the optimization work that has been done.

Don Walpola
  • 1,208
  • 5
  • 20
samra irshad
  • 571
  • 3
  • 12

1 Answers1

1

As mentioned in Digio's comment, this is a description of the exploding gradient problem. If you are familiar with Newton's method from calculus, we can look at a simple example in the single variable case. Updates are proportional to the value of the derivative of the function: $x_{n+1} = x_{n} - \frac{f(x_{n})}{f'(x_{n})}$ If the value of the derivative at $x_{n}$ is huge, $\Delta x = x_{n+1} - x_{n}$ may diverge to infinity - i.e., explode. Here is an image of this same phenomenon occurring with a multivariate function:

Update trajectory demonstrating the exploding gradient problem

The source for the above image is "Understanding the exploding gradient problem" by Pascano, Mikolov, and Bengio (2012) : https://www.semanticscholar.org/paper/Understanding-the-exploding-gradient-problem-Pascanu-Mikolov/728d814b92a9d2c6118159bb7d9a4b3dc5eeaaeb

I highly recommend reading the article - it is quite informative and will provide you with a deeper understanding of this problem.

Don Walpola
  • 1,208
  • 5
  • 20