Why do saddle points become "attractive" in Newtonian dynamics?

Question

I am reading Identifying and attacking the saddle point problem in high-dimensional non-convex optimization by Dauphin et. al. and the first paragraph on the second page states the following:

A typical problem for both local minima and saddle-points is that they are often surrounded by plateaus of small curvature in the error. While gradient descent dynamics are repelled away from a saddle point to lower error by following directions of negative curvature, this repulsion can occur slowly due to the plateau. Second order methods, like the Newton method, are designed to rapidly descend plateaus surrounding local minima by rescaling gradient steps by the inverse eigenvalues of the Hessian matrix. However, the Newton method does not treat saddle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.

I am unable to understand the last two sentences of this paragraph, more specifically:

I understand that second order methods like Newton method takes the curvature of a loss surface into account while finding the global minima. I am lost when the authors mention ...inverse eigenvalues of the Hessian matrix. Is there a more approachable way to understand this?
Why, in general, do saddle points become more attractive to Newtonian methods?

I would highly appreciate if there exist some intuitive explanations to the above questions.

Just consider the one dimensional case: the more curved the error curve the smaller you want your learning rate, conversely the less curved the larger the learning rate. Now imagine a bowl, the eigenvalues are the curvature along the principal axes — seanv507, Sep 19 '18 at 00:03

Sycorax · Accepted Answer · 2018-09-18T15:09:06.520

There's not really a more intuitive way to think about this. Suppose that you have the eigendecomposition of the Hessian for $P$ an orthonormal matrix of eigenvectors and $D$ diagonal matrix of eigenvalues. $$ \begin{align} \nabla^2 f(x) &= PDP^\top \\ \left[\nabla^2 f(x)\right]^{-1} &= PD^{-1}P^\top \end{align} $$ This is relevant to Newton's method because the update is given by $$ \begin{align} x^{(t+1)} &= x^{(t)}-\left[\nabla^2 f(x^{(t)})\right]^{-1}\nabla f(x^{(t)}) \\ &= x^{(t)}-PD^{-1}P^\top\nabla f(x^{(t)}) \end{align} $$
Saddle points have gradient 0, and Newton's method seeks points with gradient 0. If the problem is non-convex, then depending on the starting point, you may find yourself in the "basin of attraction" of the saddle point.

I also think that this post is of interest:

Why do saddle points become "attractive" in Newtonian dynamics?

1 Answers1

Related