I am reading Identifying and attacking the saddle point problem in high-dimensional non-convex optimization by Dauphin et. al. and the first paragraph on the second page states the following:
A typical problem for both local minima and saddle-points is that they are often surrounded by plateaus of small curvature in the error. While gradient descent dynamics are repelled away from a saddle point to lower error by following directions of negative curvature, this repulsion can occur slowly due to the plateau. Second order methods, like the Newton method, are designed to rapidly descend plateaus surrounding local minima by rescaling gradient steps by the inverse eigenvalues of the Hessian matrix. However, the Newton method does not treat saddle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.
I am unable to understand the last two sentences of this paragraph, more specifically:
- I understand that second order methods like Newton method takes the curvature of a loss surface into account while finding the global minima. I am lost when the authors mention ...inverse eigenvalues of the Hessian matrix. Is there a more approachable way to understand this?
- Why, in general, do saddle points become more attractive to Newtonian methods?
I would highly appreciate if there exist some intuitive explanations to the above questions.