Saddle-free Newton method for SGD - while Newton attracts saddles, is it worth to actively repel them?

Question

While 2nd order methods have many advantages, e.g. natural gradient (e.g. in L-BFGS) attracts to close zero gradient point, which is usually saddle. Other try to pretend that our very non-convex function is locally convex (e.g. Gauss-Newton, Levenberg-Marquardt, Fisher information matrix e.g. in K-FAC, gradient covariance matrix in TONGA - overview) - again attracting rather not only to local minima (how bad it is?).

There is a belief that the number of saddles is ~exp(dim) larger than of minima. Actively repelling them (instead of attracting) requires control of sign of curvatures (as Hessian eigenvalues) - e.g. negating step sign in these directions.

It is e.g. done in saddle-free Newton method (SFN) ( https://arxiv.org/pdf/1406.2572 ) - 2014, 600+ citations, recent github. They claim to get a few times(!) lower error e.g. on MNIST this way, other methods got stuck on some plateaus with strong negative eigenvalues:

Here is another very interesting paper: https://arxiv.org/pdf/1902.02366 investigating evolution of eigenvalues of Hessian for 3.3M parameters (~20 terabytes!), for example showing that rare negative curvature directions allow for relatively huge improvements:

So it looks great - it seems that we all should use SFN or other methods actively repelling saddles ... but it didn't happen - why is it so? What are the weaknesses?

What are other promising 2nd order approaches handling saddles?

How can we improve SFN-like methods? For example what I mostly don't like is directly estimating Hessian from noisy data, what is very problematic numerically. Instead, we are really interested in linear behavior of 1st derivative - we can optimally estimate it with (online) linear regression of gradients: with weakening weights of old gradients. Another issue is focusing on Krylov subspace due to numerical method (Lanczos) - it should be rather based on gradient statistics like their PCA, what again can be made online to get local statistically relevant directions.

user292463 · Answer 1 · 2020-08-03T12:43:17.663

1

my joint paper

https://arxiv.org/abs/2006.01512

Here is a github link for python codes:

https://github.com/hphuongdhsp/Q-Newton-method

Gives theoretical proof of the heuristic argument in the second paper you linked to in your question. We also provide a simple way of how to proceed in the case the Hessian is not invertible.

Two issues I think now prevents the use in large scale:

Cost of implementation. I read that there are some methods to reduce the cost but need to look in details.

No guarantee of convergence. Maybe for loss functions in popular DNN we can hope to prove convergence.

On the other hand, a very well theoretical justified first order methods, working well in large scale is Backtracking GD. You can check the codes here

https://github.com/hank-nguyen/MBT-optimizer

P.S. I don’t consider having an account on this site, so if you want more discussion it is better through emails.

edited Aug 03 '20 at 12:43

answered Jul 27 '20 at 08:55

user292463

31
3

Thanks, I haven't worked on it for a year, but will take a closer look in a few days. Some student implemented 1D subspace case: just extending e.g. ADAM with online parabola model in its single direction for better choice of step size ( https://arxiv.org/pdf/1907.07063 ), but it wasn't able to beat ADAM so far ... while it controls sign of eigenvalue, indeed it being close to 0 turned out to be a big problem - needs further work, especially testing if tracing Hessian for a larger subspace can be essentially better. – Jarek Duda Jul 27 '20 at 09:20
Thanks for the reference, I will check. Is there any theoretical guarantee for your new method? – user292463 Jul 27 '20 at 09:50
After fixing a direction, estimation of 2nd derivative can be seen as linear regression of 1st derivatives, what can be made online (e.g. by replacing average with exponential moving average). For parabola such linear regression would give proper 2nd derivative from just two points... the problem is that gradients are noisy here, it is not really a parabola, and that the modeled direction rotates e.g. accordingly to some momentum/ADAM (or e.g. online PCA for multiple directions). I would gladly discuss it. – Jarek Duda Jul 27 '20 at 13:34
Hi, I looked at your paper. While it has some good heuristic, my concern is that it is complicated and has no theoretical guarantee. Backtracking gd works better than Adam on Cifar10 (you can check the github link I gave) and it is simple: based on Armijo’s condition only. It also has a lot of theoretical guarantee. – user292463 Jul 28 '20 at 07:35
Complicated?? Online linear regression of gradients is just maintaining 4 exponential moving averages: of position "x += eta * (newX - x)", gradient g (as in momentum), also of x^2 and gx - can you get simpler/cheaper 2nd order method? There is theoretical guarantee of getting parabola from derivatives in just two points. The main reason for not beating adam might be that I am practically alone with that and have other projects. I will have to look at backtracking gd, but it seems orthogonal concept which could be also included here. – Jarek Duda Jul 28 '20 at 08:32
For being complicated: your update rule is already more complicated than Adam and RMSProp, for example, if I read correctly. For theoretical guarantee: I meant, for example, can you prove convergence for your algorithm in some nontrivial cases: e.g. non (strongly) convex? – user292463 Jul 28 '20 at 10:00
The current race is to get better 2nd order methods (e.g. saddle-free, K-FAC, tonga etc.) ADAM is 1st order, linear regression of gradients is 2nd order - additionally suggesting the best local step size, getting minimum in one step for parabola - this one is trivial, there are many ways to use such found 2nd order information, this is not time for general proofs. I was alone with that, a student has implemented a basic version but it wasn't better, I have moved on to other projects but plan to return in some future. – Jarek Duda Jul 28 '20 at 10:49
Do you mean better theoretically or experimentally (including implementation? For theory: the one which you call saddle free newton, actually a precise version of it in my linked joint paper, can avoid saddle point and is still fast. Convergence is not guaranteed though. What do you mean by “better”, and how can you know/convince a method is “better” if as you say “this is not time for general proofs”? – user292463 Jul 28 '20 at 13:00
This is only alternative way to estimate 2nd order behavior, there is a large freedom of building actual method from such knowledge, starting with smarter choice of step length for e.g. ADAM for single direction. Finally I would like to build something something like SFN: update promising subspace with e.g. online PCA, on it update Hessian model with such online linear regression. For Hessian estimation there is practically only direct backpropagation (very costly) - linear regression of gradients is cheap and allows for online model. – Jarek Duda Jul 28 '20 at 13:10
I have looked in your arxiv, see that in negative eigenvalue directions you change the sign as in SFN (what is arbitrary choice), I don't know what do you do with the problematic eigenvalue~0 cases ... but most importantly I don't know how you estimate Hessian, what is the central problem here? Maybe email me if you would like to discuss, we can e.g. make a zoom call. – Jarek Duda Jul 31 '20 at 07:49
Hi, if your hessian nabla ^2f(x_n) is non-invertible, then you add terms of the form delta _j ||nabla f(x_n)||^2, where delta _j is appropriately chosen from an ordered set of at least d+1 elements. Here d is the dimension of your space. As we wrote in our paper, currently we don’t know how to implement this efficiently in large scale. I don’t understand what you mean by “what is arbitrary choice”. – user292463 Aug 03 '20 at 09:40
Hi, non-invertible Hessian means required infinite Newton step to get zero gradient position. What you write sounds like regularization pretending convexity - for in practice very nonconvex function, can you control how bad this approximation is? Your "don't know how to implement" sounds like you miss the most crucial piece: Hessian estimator, e.g. in some lower dimensional subspace. My approach is exactly it - estimating Hessian: as linear behavior of gradients - from their online linear regression. – Jarek Duda Aug 03 '20 at 10:23
Here is one specific example. Assume your space has dimension d=5. Then I can choose delta _j in the set {0,1,2,3,4,5}. Assume your hessian nabla ^2f(x_n) at a step is non-invertible. Then you replace the Hessian by A_n=nabla ^2f(x_n) +delta _j ||f(x_n)||^2. Here you choose delta _j to be the smallest number among {0,1,...,5} so that A_n is invertible (since your dimension is 5, and you have 6 numbers delta _j, at least one of them will be good for the purpose, if nabla f(x_n) is not zero - in the opposite case x_n is a critical point and hence you are done). – user292463 Aug 03 '20 at 11:25
(Cont. from above) then you do what you call SFN for A_n, instead of for nabla ^2f(x_n). A_n is only invertible, it can have negative eigenvalues. Yes, we did not do hessian estimate, and we are finding ways to go. Your approach is of course relevant, and so I would like to know if you have any theoretical guarantee that you could prove for hessian estimation. – user292463 Aug 03 '20 at 11:27
For practical neural networks the dimension is in millions (making Hessian size in trillions). What you do seems like some strange regularization technique (like in https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm ), what is tough to control approximation - especially for extremely nonconvex functions as in real-life problems. And still I don't know how do you estimate Hessian? – Jarek Duda Aug 03 '20 at 12:12
I don’t think that our method is similar to the one you linked to. When you are close to a critical point, our A_n is very close to the hessian, and we can prove that if it converges then it does quickly. While for the one you linked, things are different. Indeed, I think that our method is very different from all existing methods. Concerning hessian estimate: as I wrote, we did not do it yet. We only need the precise value of the hessian in our paper, and the method can work well in small scale, like dimension=10^4. For large scale, please see the discussion part in our paper. – user292463 Aug 03 '20 at 12:32
For convenience: I added in my main answer the github link for the code for our new Newton’s method. – user292463 Aug 03 '20 at 12:42
Hessian estimation is the central problem here with lots of issues, like enormous dimension - necessary to reduce it (how?), need to invert it especially for noisy estimate, etc. I can help you with that. – Jarek Duda Aug 03 '20 at 12:58
Thanks. Yes, Hessian estimation (as good as possible) is one important factor in implementing our method in large scale. I will discuss more with you later via email. – user292463 Aug 03 '20 at 15:48
See my slides https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf We can discuss through email or e.g. make a zoom call. – Jarek Duda Aug 03 '20 at 15:56

Saddle-free Newton method for SGD - while Newton attracts saddles, is it worth to actively repel them?

1 Answers1

Linked