219

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes:

After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient descent to find globally optimal solutions, because Newton's method can provide a guarantee for its solution, it's affine invariant, and most of all it converges in far fewer steps. Why is second-order optimization algorithms, such as Newton's method not as widely used as stochastic gradient descent in machine learning problems?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Fei Yang
  • 2,181
  • 3
  • 8
  • 4
  • 34
    For neural networks, http://www.deeplearningbook.org/ Section "8.6 Approximate Second-Order Methods" gives a nice overview. In summary "Beyond the challenges created by certain features of the objective function, such as saddle points, the application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes." There exist alternatives that attempt to gain some of the advantages of Newton’s method while side-stepping the computational hurdles, but they have their own issues. – Franck Dernoncourt Dec 29 '16 at 02:56
  • 1
    see this related question and comments, http://stats.stackexchange.com/questions/232305/is-that-true-newtons-method-quasi-newton-method-are-not-widely-used-in-deep-n – Haitao Du Dec 29 '16 at 03:10
  • 1
    Note that the other comments have some wider applicability to machine learning beyond just "deep learning". However while all ML problems can tend to be "big data", not all ML problems are necessarily "big features" (i.e. many parameters to tune), though deep learning invariably is. – GeoMatt22 Dec 29 '16 at 03:49
  • 3
    It's worth noting that in machine learning outside of deep learning, L-BFGS (which, roughly speaking, approximates Newton's method) *is* a fairly common optimization algorithm. – Danica Jan 03 '17 at 22:16
  • 4
    Newton's method assumes convexity, modern ML problems (neutral nets) are not likely anywhere near convex, though admittedly an area of open research there. Hence Newton's method is probably as bad an estimator as linear anywhere but near the point of calculation. You'll probably gain very little for a quadratic increase in computation. That said, a recent conference at Berkeley had a presenter continuing to show progress in using 2nd order methods, so it's not dead by any means. – David Parks Jan 11 '17 at 01:40
  • @davidparks I think most people would say nns have lots of local minima (ie lots of convex sub regions) so second order should get you to a local minimum faster than gradient descent. – seanv507 Jan 11 '17 at 08:54
  • 1
    @seanv507 you're absolutely right about local minima, however the solution space itself of a neural network is very much nonconvex. A convex approximation at point $x$ won't get you to the nearest local minima, it will give you global minimum assuming the entire solution space is convex. That assumption is likely invalidated rather quickly as you move away from the point of calculation, necessitating a modest step size and many recalculations. I concede that a convex approximation is likely to allow for a larger step size than linear, but probably not so large as to justify the quadratic cost. – David Parks Jan 11 '17 at 23:58
  • FYI, one data point that this may be changing a bit: a recent Scikit Learn pull request of a Gradient Boosting Regression Tree - a LightGBM-esq variant - uses Newton's method. https://github.com/scikit-learn/scikit-learn/pull/12807 – jeffhale Feb 20 '19 at 22:18
  • Another related question: https://stats.stackexchange.com/questions/394083/why-second-order-sgd-convergence-methods-are-unpopular-for-deep-learning – Sycorax Apr 22 '19 at 15:43

9 Answers9

137

Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative. That can be faster when the second derivative is known and easy to compute (the Newton-Raphson algorithm is used in logistic regression). However, the analytic expression for the second derivative is often complicated or intractable, requiring a lot of computation. Numerical methods for computing the second derivative also require a lot of computation -- if $N$ values are required to compute the first derivative, $N^2$ are required for the second derivative.

jwimberley
  • 3,679
  • 2
  • 11
  • 20
  • 7
    Worth noting that (things based on) the [Gauss-Newton](https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm) method are probably more common. This is a specialization of Newton to nonlinear least squares. – GeoMatt22 Dec 29 '16 at 01:26
  • 4
    I wouldn't call Gauss-Newton a specialization of Newton to nonlinear least squares. I'd call it a bastardized approximation of Newton for nonlinear least squares, which uses a more inaccurate Hessian approximation, the larger the residuals in the fitted equations, and accordingly, the further the argument is from optimality. – Mark L. Stone Dec 29 '16 at 04:12
  • 1
    @MarkL.Stone fair point, I was trying to not get into technicalities :) It is true that Gauss-Newton style methods try to "fake" 2nd order w/only 1st order info. Personally I have never used Newton methods for optimization, just Gauss-Newton (or LM, or ~similar UKF) or DFO-SQP methods (e.g. [BOBYQA](https://en.wikipedia.org/wiki/BOBYQA)). "Optimality" is a tricky question I would say ... for a ML problem, vs. say an engineering design-optimization problem, the reliability/informativeness of a "local Hessian" can be dubious. Perhaps non-local DFO-SQP is ~"stochastic Newton"? (e.g. "online") – GeoMatt22 Dec 29 '16 at 05:26
  • 1
    On second thought, DFO-SQP approaches tend to be nonlocal in the *parameter* space, rather than data batches. The [UKF](https://en.wikipedia.org/wiki/Kalman_filter#Unscented_Kalman_filter) may be the closest in flavor to "stochastic Newton" as it is online w/limited memory ... but it effectively assumes a positive-definite Hessian (i.e. Gaussian approx.). – GeoMatt22 Dec 29 '16 at 05:34
  • 2
    Actually that is misleading reason since there are second order methods like CG that doesn't require computing the hessian. k iterations of CG will cost only kN. It is correct that CG would theoretically match Newton only at k=N, but really you dont need so many iterations. – user25322 Oct 06 '17 at 14:22
  • @jwimbereley I'm sorry, I don't get it. https://en.wikipedia.org/wiki/Newton%27s_method according to this, I don't see why the second derivative is required... can you explain please? – Gulzar Nov 09 '18 at 15:05
  • 1
    @Gulzar That wikipedia page is talking about using Newton's method to find the root of a function, where it equals zero. Maximizing or minimizing a function is usually to finding a root of its derivative (exceptions come from solutions lying on boundaries of allowed regions). So, using Newton's method for optimization requires the derivative of the derivative, i.e. second derivative. – jwimberley Nov 09 '18 at 16:05
  • [This paper from Caplan (2011)](http://web.mit.edu/pcaplan/www/SecondDerivative2012.pdf) seems to explain a method to calculate the second derivate (Jacobian matrix) in N operations instead of N^2 – LMB Jul 12 '19 at 08:26
77

More people should be using Newton's method in machine learning*. I say this as someone with a background in numerical optimization, who has dabbled in machine learning over the past couple of years.

The drawbacks in answers here (and even in the literature) are not an issue if you use Newton's method correctly. Moreover, the drawbacks that do matter also slow down gradient descent the same amount or more, but through less obvious mechanisms.

  • Using linesearch with the Wolfe conditions or using or trust regions prevents convergence to saddle points. A proper gradient descent implementation should be doing this too. The paper referenced in Cam.Davidson.Pilon's answer points out problems with "Newton's method" in the presence of saddle points, but the fix they advocate is also a Newton method.

  • Using Newton's method does not require constructing the whole (dense) Hessian; you can apply the inverse of the Hessian to a vector with iterative methods that only use matrix-vector products (e.g., Krylov methods like conjugate gradient). See, for example, the CG-Steihaug trust region method.

  • You can compute Hessian matrix-vector products efficiently by solving two higher order adjoint equations of the same form as the adjoint equation that is already used to compute the gradient (e.g., the work of two backpropagation steps in neural network training).

  • Ill conditioning slows the convergence of iterative linear solvers, but it also slows gradient descent equally or worse. Using Newton's method instead of gradient descent shifts the difficulty from the nonlinear optimization stage (where not much can be done to improve the situation) to the linear algebra stage (where we can attack it with the entire arsenal of numerical linear algebra preconditioning techniques).

  • Also, the computation shifts from "many many cheap steps" to "a few costly steps", opening up more opportunities for parallelism at the sub-step (linear algebra) level.

For background information about these concepts, I recommend the book "Numerical Optimization" by Nocedal and Wright.

*Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness.

Nick Alger
  • 1,133
  • 6
  • 12
  • 3
    I think we're in violent agreement with each other, not with everyone else. – Mark L. Stone Dec 30 '16 at 14:43
  • 1
    Tell me what you think of this paper http://link.springer.com/article/10.1007/s11081-016-9323-4 ? Line search Newton method applied to very non-convex problems w/no adjustment to Hessian or searching along directions of negative curvature. No wonder he thinks Quasi-Newton (probably BFGS) is more robust than Newton. The author claims he used a simple version of Newton's method so as to have an apples to apples comparison between Euclidean and Riemannian Newton's method. (continued in next comment) – Mark L. Stone Dec 30 '16 at 14:52
  • 1
    That's like comparing whether UK or USA produces better research mathematicians by comparing the mathematical abilities of 26 year old drug addict high school dropouts, rather than by comparing the top echelon of mathematics graduate students coming out of each country's finest schools. The paper is signed, sealed, and delivered, no one, and I mean no one's changing it or withdrawing it now. Incroyable. – Mark L. Stone Dec 30 '16 at 14:52
  • 4
    @MarkL.Stone It seems a conversation happened here and was deleted while I was away. Anyways, I think you're right that we agree with each other and no one else. I guess this is to be expected based on our background as compared to the other people here. As you probably expect I don't think very much of the linked paper. On the other hand, I do think that Riemannian manifold *Newton's method*, where one shoots a geodesic trajectory in a Newton search direction, is a technique with a lot of promise for very hard problems. – Nick Alger Dec 30 '16 at 22:15
  • 2
    Could you provide links to the more technical terms you use by chance? (Wolfe conditions, CG-Steihaug trust region method, conjugate gradient, adjoint equation, ...). It seems like you know a lot, but I can't understand it and I wish I could. – Pro Q Jan 02 '17 at 07:52
  • 2
    How would you deal with a large training set? If you have e.g. 1 million training samples, then just evaluating the current optimization objective requires testing 1 million samples. And you need to do that multiple times during a line search. So by the time you've done 1 Newton step, Stochastic Gradient Descent will have done a few million updates. – nikie Jan 02 '17 at 09:20
  • @nikie There are stochastic variants of Newton methods; see e.g. [these slides](http://www.di.ens.fr/~aspremon/Houches/talks/Goldfarb.pdf) for a brief overview. (This is not something I know much about, though.) – Danica Jan 03 '17 at 21:54
  • 2
    Nick and @MarkL.Stone: Are you talking about essentially [this approach](http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf)? This is something that was briefly popular in deep learning, especially for recurrent nets, but has since fallen out of favor I assume because it just didn't empirically work that much better than adaptive gradient methods. If they were just doing something wrong, and you fix whatever it is and show it generally outperforms the current standard SGD variant Adam, you might make a big impact: the Adam paper has had 1345 citations in two years.... – Danica Jan 03 '17 at 22:14
  • 1
    @ProQ the best intro to all this is "Numerical optimization" Nocedal and Wright. nikie: Stochastic gradient descent is OK, but the iterates only converge in a statistical sense and not to high tolerance, so it is comparing apples to oranges. There are ways to randomize Newton's method as well; this is done routinely in seismic imaging, which has a similar high data problem. Dougal: That is basically what I'm talking about, but with 2 major oversights - they compute the Hessian action with finite differences rather than adjoints, and don't use a preconditioner for CG. – Nick Alger Jan 05 '17 at 07:39
  • 1
    @NickAlger Interesting post! It's nice to see a dissenting opinion like this. One frequent complaint that I don't see addressed is that stochastic gradients (and Hessian-vector approximations), which are used out of necessity on some modern ML problems, pose a problem to second-order methods. [Goodfellow's optimization chapter](http://www.deeplearningbook.org/) claims you need to take more samples because of Hessian sensitivity, and the [Adam](https://arxiv.org/abs/1412.6980) paper also mentions the same off-hand in the intro. – VF1 Jun 23 '17 at 18:08
  • 1
    I would really like to see these methods more heavily supported in Tensorflow (hint hint). L-BFGS [quasi-newton] is available only through the scipy optimizer interface. But [empirically speaking] it beats the ADAM optimizer (arguably best grad descent optimizer out there now) drastically in both convergence time and accuracy on every problem I've tried it on (even if each individual step is more expensive).. – JPJ Aug 04 '18 at 05:33
  • 1
    If quasi-newtonian methods could be devised to run on minibatches (and not require the whole dataset) I think it would stimulate another leap in accuracy on benchmarks (and ML in general) similar to what batch norm and resnets have done. – JPJ Aug 04 '18 at 05:39
  • Always meant to ask: what are some good open-source implementations of a quasi-Newton method using some of these mechanisms (trust regions, smarter matrix-vector calculations)? Which of these does L-BFGS use? – jwimberley Apr 18 '19 at 12:38
  • 2
    "I recommend the book "Numerical Optimization" by Nocedal and Wright." Ah, the irony, given that [Nocedal recommends SGD](https://arxiv.org/abs/1606.04838) over traditional Newton-like methods for large scale ML! – Cliff AB Mar 26 '20 at 23:17
  • @CliffAB Nocedal is also a co-author on a more recent paper which does advocate a Newton-type method for machine learning: Bollapragada, Raghu, Richard H. Byrd, and Jorge Nocedal. "Exact and inexact subsampled Newton methods for optimization." IMA Journal of Numerical Analysis 39.2 (2019): 545-578. – Nick Alger Apr 01 '20 at 21:45
  • Interesting. However that paper doesn't actually advocate for such methods over an SGD-like approach. It's my understanding that it's more of an open question at this time. In general, SGD approaches are very well established, reliable and fast at this point, while it's my understanding that stochastic second order methods are still being investigated. And non-stochastic methods are pretty much dead for such problems. – Cliff AB Apr 01 '20 at 22:44
62

A combination of two reasons:

  • Newton method attracts to saddle points;
  • saddle points are common in machine learning, or in fact any multivariable optimization.

Look at the function $$f=x^2-y^2$$ enter image description here

If you apply multivariate Newton method, you get the following. $$\mathbf{x}_{n+1} = \mathbf{x}_n - [\mathbf{H}f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n)$$

Let's get the Hessian: $$\mathbf{H}= \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.$$

$$\mathbf{H}= \begin{bmatrix} 2 & 0 \\[2.2ex] 0 & -2 \end{bmatrix}$$

Invert it: $$[\mathbf{H} f]^{-1}= \begin{bmatrix} 1/2 & 0 \\[2.2ex] 0 & -1/2 \end{bmatrix}$$

Get the gradient: $$\nabla f=\begin{bmatrix} 2x \\[2.2ex] -2y \end{bmatrix}$$

Get the final equation: $$\mathbf{\begin{bmatrix} x \\[2.2ex] y \end{bmatrix}}_{n+1} = \begin{bmatrix} x \\[2.2ex] y \end{bmatrix}_n -\begin{bmatrix} 1/2 & 0 \\[2.2ex] 0 & -1/2 \end{bmatrix} \begin{bmatrix} 2x_n \\[2.2ex] -2y_n \end{bmatrix}= \mathbf{\begin{bmatrix} x \\[2.2ex] y \end{bmatrix}}_n - \begin{bmatrix} x \\[2.2ex] y \end{bmatrix}_n = \begin{bmatrix} 0 \\[2.2ex] 0 \end{bmatrix} $$

So, you see how the Newton method led you to the saddle point at $x=0,y=0$.

In contrast, the gradient descent method will not lead to the saddle point. The gradient is zero at the saddle point, but a tiny step out would pull the optimization away as you can see from the gradient above - its gradient on y-variable is negative.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • 1
    Thanks to you I actually understood how this method works from A to Z, so thank you very much for this clear example! – greenoldman Sep 19 '17 at 18:02
  • 1
    What would be the the favourite point here? – Ben Aug 12 '19 at 11:48
  • this is a great response, would love to see how this expands to gradient descent – Josh Oct 04 '20 at 23:58
  • There are approaches which can handle this issue e.g. replacing attraction with repulsion in these directions (of negative curvature), like saddle-free Newton: https://stats.stackexchange.com/questions/404217/saddle-free-newton-method-for-sgd-while-newton-attracts-saddles-is-it-worth-t – Jarek Duda Dec 06 '20 at 07:49
  • @JarekDuda the question is About Newton method. What you call SFN is different thing. In fact it’s a misnomer. It’s not a Newton method anymore, not its extension or generalization, just a roughing up the hessian in arbitrary way – Aksakal Jan 23 '21 at 22:10
  • "In contrast, the gradient descent method will not lead to the saddle point." This isn't an accurate statement, if you start from $y=0$ in your example, then you will in fact converge to the saddle point. It is also easy to imagine examples in two variables where the attraction to the saddle point for gradient descent is more extreme. For more information, you can check out methods such as perturbed gradient descent, where you'll find discussions for just how bad gradient descent can be against saddle points. – Simply Beautiful Art Apr 14 '21 at 17:32
  • 1
    @SimplyBeautifulArt the next sentence explains clarifies the statement. the saddle point is not stable in gradient descent, any tiny step will push you away from it – Aksakal Apr 14 '21 at 18:00
  • 1
    Let me be more precise with my previous statement. Although it is the case that gradient should *eventually* leave a first order saddle point, it may have to get very very close to it to do so. [Gradient Descent Can Take Exponential Time to Escape Saddle Points](https://papers.nips.cc/paper/6707-gradient-descent-can-take-exponential-time-to-escape-saddle-points.pdf) figure 1 provides a good example. – Simply Beautiful Art Apr 14 '21 at 18:43
  • 1
    Although the saddle point is unstable for gradient descent, it's also worth noting that because the gradient is zero at the saddle point, gradient descent isn't leaving quickly. In practice, this can lead to "convergence" to the saddle point, if one is not careful. – Simply Beautiful Art Apr 14 '21 at 18:44
  • @SimplyBeautifulArt, ok, i agree with both your late comments. – Aksakal Apr 15 '21 at 13:49
39

I recently learned this myself - the problem is the proliferation of saddle points in high-dimensional space, that Newton methods want to converge to. See this article: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.

Indeed the ratio of the number of saddle points to local minima increases exponentially with the dimensionality N.

While gradient descent dynamics are repelled away from a saddle point to lower error by following directions of negative curvature, ...the Newton method does not treat saddle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.

Cam.Davidson.Pilon
  • 11,476
  • 5
  • 47
  • 75
  • 3
    Could you add some explanation to why this is so? In theory, Newton's method preforms a weighted gradient descent with "optimal" weights for each of the eigenvectors. – nbubis Dec 29 '16 at 17:15
  • 4
    What that article says about Newton methods "wanting to" converge to saddle points is only true for garbage implementations of Newton's method. – Mark L. Stone Dec 30 '16 at 04:16
  • 1
    The paper reparameterizes the problem in terms of eigenvalues and eigenvectors, and uses that to show that gradient descent moves away from a saddle point: it moves towards the saddle point in the direction of negative e-vectors, but it moves away in the direction of positive e-vectors, so it ultimately leaves the saddle point. Newton, on the other hand, has no such guarantee. – Elizabeth Santorella Jan 03 '17 at 20:18
  • The new algorithm they advocate for in this paper is (a variant of) Newton's method though. it is basically Newton's method for the directions of positive curvature and negative Newton's method for the directions of negative curvature. – Nick Alger Jan 05 '17 at 07:44
  • Would you mind telling me what you mean by dynamics? I searched a lot but did not find anything. – Green Falcon Sep 25 '20 at 05:25
  • 1
    The linked paper introduces saddle-free Newton methods able to handle this issue: by changing sign of step in negative curvature directions - this problem can be resolved by controlling Hessian eigenvalues. – Jarek Duda Dec 06 '20 at 07:56
23

You asked two questions: Why don't more people use Newton's method, and why do so many people use stochastic gradient descent? These questions have different answers, because there are many algorithms that lessen the computational burden of Newton's method but often work better than SGD.

First: Newton's Method takes a long time per iteration and is memory-intensive. As jwimberley points out, Newton's Method requires computing the second derivative, $H$, which is $O(N^2)$, where $N$ is the number of features, while computing the gradient, $g$, is only $O(N)$. But the next step is $H^{-1} g$, which is $O(N^3)$ to compute. So while computing the Hessian is expensive, inverting it or solving least squares is often even worse. (If you have sparse features, the asymptotics look better, but other methods also perform better, so sparsity doesn't make Newton relatively more appealing.)

Second, many methods, not just gradient descent, are used more often than Newton; they are often knockoffs of Newton's method, in the sense that they approximate a Newton step at a lower computational cost per step but take more iterations to converge. Some examples:

  • Because of the expense of inverting the Hessian, ``quasi-Newton" methods like BFGS approximate the inverse Hessian, $H^{-1}$, by looking at how the gradient has changed over the last few steps.

  • BFGS is still very memory-intensive in high-dimensional settings because it requires storing the entire $O(N^2)$ approximate inverse Hessian. Limited memory BFGS (L-BFGS) calculates the next step direction as the approximate inverse Hessian times the gradient, but it only requires storing the last several gradient updates; it doesn't explicitly store the approximate inverse Hessian.

  • When you don't want to deal with approximating second derivatives at all, gradient descent is appealing because it only uses only first-order information. Gradient descent is implicitly approximating the inverse Hessian as the learning rate times the identity matrix. I, personally, rarely use gradient descent: L-BFGS is just as easy to implement, since it only requires specifying the objective function and gradient; it has a better inverse Hessian approximation than gradient descent; and because gradient descent requires tuning the learning rate.

  • Sometimes you have a very large number of observations (data points), but you could learn almost as well from a smaller number of observations. When that is the case, you can use "batch methods", like stochastic gradient descent, that cycle through using subsets of the observations.

  • 1
    (+1) It's worth noting that L-BFGS is of the same order of complexity as gradient descent in regards to the number of parameters. This is not the case for BFGS. So it's not *just* the limited memory part of L-BFGS that makes it attractive. – Cliff AB Mar 16 '19 at 18:18
  • Sure using full Hessian is rather impractical for NN training, but we can e.g. restrict to "locally interesting subspace" - like spanned by top 10 Hessian eigenvectors suggested by https://arxiv.org/pdf/1812.04754 ... which can be practically traced with methods like online PCA ( http://cs-www.cs.yale.edu/homes/el327/papers/opca.pdf ). – Jarek Duda Dec 06 '20 at 08:01
13

Gradient descent direction's cheaper to calculate, and performing a line search in that direction is a more reliable, steady source of progress toward an optimum. In short, gradient descent's relatively reliable.

Newton's method is relatively expensive in that you need to calculate the Hessian on the first iteration. Then, on each subsequent iteration, you can either fully recalculate the Hessian (as in Newton's method) or merely "update" the prior iteration's Hessian (in quasi-Newton methods) which is cheaper but less robust.

In the extreme case of a very well-behaved function, especially a perfectly quadratic function, Newton's method is the clear winner. If it's perfectly quadratic, Newton's method will converge in a single iteration.

In the opposite extreme case of a very poorly behaved function, gradient descent will tend to win out. It'll pick a search direction, search down that direction, and ultimately take a small-but-productive step. By contrast, Newton's method will tend to fail in these cases, especially if you try to use the quasi-Newton approximations.

In-between gradient descent and Newton's method, there're methods like Levenberg–Marquardt algorithm (LMA), though I've seen the names confused a bit. The gist is to use more gradient-descent-informed search when things are chaotic and confusing, then switch to a more Newton-method-informed search when things are getting more linear and reliable.

Nat
  • 775
  • 1
  • 5
  • 11
  • 4
    Boy, you must use terrible implementations of Newton and Quasi-Newton. If using either with a non positive definite Hessian, then either use trust regions or do line search along direction(s) of negative curvature. If so, they are MORE reliable than steepest descent (i.e, gradient descent with line search or trust region). In short, gradiewnt descent is much less reliable than a properly implemented Quasi-Newton method, which is less reliable than a properly implemented Newton method. Computation time and memory requirements per iteration are a different matter however. – Mark L. Stone Dec 29 '16 at 12:41
  • 4
    I think you mean perfectly quadratic function. That is, Newton's method converges in a single iteration with a quadratic objective function, which has a linear gradient. – Elizabeth Santorella Dec 29 '16 at 22:22
  • @MarkL.Stone: By definition, gradient descent goes in the optimal direction for an infinitely small step. By contrast, Newton's method uses second-order effects to project a zero, attempting to get a more direct route assuming that you'll make a more significant step. For a very poorly behaved function where step sizes are arbitrarily small, gradient descent produces the optimal search direction. If your implementation of Newton's method gives you the optimal search direction for tiny steps, it's not Newton's method - it's steepest descent, by definition. – Nat Dec 30 '16 at 03:29
  • 1
    @ElizabethSantorella: Yup, you're right! I updated the answer. – Nat Dec 30 '16 at 03:38
  • @Chemical Engineer In your last sentence, you have a very strange definition. In the limit if you you restrict algorithms to take a zero length step, then steepest descent is in a multi-way tie for being optimal; otherwise, not. Here's a way to think about steepest descent, it is Newton's method, which is based on a quadratic expansion, except that instead of using the true Hessian in the 2nd order term, it uses approximation that Hessian equals the Identity matrix. This decreases fidelity of the quadratic model, and makes any non-zero step based on it inferior unless true Hessian = Identity – Mark L. Stone Dec 30 '16 at 03:59
  • 3
    The advantage of a well-implemented and safeguarded Newton method over steepest descent increases the nastier, more ill-condtioned, more non-convex the function is. If you are minimizing the best-behaved quadratic function there is, having a $1/2 x^Tx$ quadratic term, i.e., Hessian = Identity matrix, then steepest descent is just fine, and is the same as Newton's method. – Mark L. Stone Dec 30 '16 at 04:06
  • @MarkL.Stone: I disagree with the interpretation of steepest descent having an assumed Hessian of the identity matrix. I mean, this would be true **if** you used the steepest descent method to calculate a step distance in addition to the step direction, in which case Newton's is clearly superior (except for cost). But, at the limit of an infinitely small step, Newton's and steepest descent make numerically different directional choices. Newton's is better if you're projecting the zero, while the simple gradient is the direction of greatest change at the point. – Nat Dec 30 '16 at 04:07
  • If no safeguard, such as trust regions or line search is used, then steepest descent and Newton's method are both garbage. Steepest descent without safeguards is more commonly known nowadays as gradient descent, which contrary to the name, need not descend, even for a convex objective function. – Mark L. Stone Dec 30 '16 at 04:11
  • @MarkL.Stone: Gradient descent implies a line search, right? Since gradient descent fundamentally doesn't provide a step (unless you calculate a step size using Newton's method, interpreting the Hessian as identity - but that's not steepest descent, that's just a bastardized Newton's method), line searching is the default approach to gradient descent. Or, if you choose an assumed step size, that's really more of a pattern search algorithm, which is a different animal. – Nat Dec 30 '16 at 04:18
  • Steepsest descent, gradient descent, whatever you want to call them, are based on a a quadratic model of the objective (or more generally, Lagrangian in the case of nonlinear constraints) in which the Hessian is approximated as being the Identity matrix. That said, it is malpractice of high order to use an unsafeguarded Newton method on a non positive semi-definite Hessian which has not been adjusted to be positive semi-definite. Unbelievably, some of that garbage even occasionally gets published http://link.springer.com/article/10.1007/s11081-016-9323-4 in nominally reputable journals – Mark L. Stone Dec 30 '16 at 04:24
  • I've never heard your definition of steepest descent (or gradient descent; I'm using them interchangeably too), and honestly I can't agree with it. The conceptual basis for steepest descent is that, from calculus, we know that the direction of greatest change at a point is the gradient at that point. So, we do a line search down that direction, then iteratively repeat. To say that this is an approximation of Newton's with a Hessian of identity suggests that we should also calculate a step size, which is what Newton's does. But that doesn't follow. – Nat Dec 30 '16 at 04:28
  • 2
    I've made my case. if you want to think steepest descent, gradient descent are wonderful, especially on poorly behaved functions, that's your business. Knock yourself out. – Mark L. Stone Dec 30 '16 at 04:31
  • Mark L. Stone is correct here; The continuous Newton path is usually a better, more direct path than the continuous gradient descent path. Consider walking up an ellipsoidal mountain with one ellipsoid direction very steep and the other very wide. If you do gradient ascent, you will first walk up the steepest direction to roughly the "ridge", then walk along the ridge to the peak in a dog-leg pattern. With steps in the Newton direction, you walk in a straight line (in the x-y plane) from start to peak. This "ridge" problem is fundamental and gets worse for gradient ascent in higher dimensions. – Nick Alger Dec 30 '16 at 13:16
8

For large dimensions, the Hessian is typically expensive to store and solving $Hd = g$ for a direction can be expensive. It is also more difficult to parallelise.

Newton's method works well when close to a solution, or if the Hessian is slowly varying, but needs some tricks to deal with lack of convergence and lack of definiteness.

Often an improvement is sought, rather than an exact solution, in which case the extra cost of Newton or Newton like methods is not justified.

There are various ways of ameliorating the above such as variable metric or trust region methods.

As a side note, in many problems a key issue is scaling and the Hessian provides excellent scaling information, albeit at a cost. If one can approximate the Hessian, it can often improve performance considerably. To some extent, Newton's method provides the 'best' scaling in that it is affine invariant.

copper.hat
  • 181
  • 1
  • 6
2

There are many difficulties regarding the use of Newton's method for SGD, especially:

  • it requires to know local Hessian matrix - how to estimate Hessian e.g. from noisy gradients with a sufficient precision at a reasonable cost?

  • full Hessian is too costly - we rather need some its restriction, e.g. to a linear subspace (like its top eigenspace),

  • it needs inverted Hessian $H^{-1}$, what is costly and very unstable for noisy estimation - can be statistically blurred around $\lambda=0$ eigenvalues which invert to infinity,

  • Newton's method directly attracts to close point with zero gradient ... which is usually a saddle here. How to avoid this saddle attraction e.g. repelling them instead? For example saddle-free Newton reverses negative curvature directions, but it requires controlling signs of eigenvalues,

  • it would be good to do it online - instead of performing a lot of computation in a single point, try to split it into many small steps to exploit local information about the landscape.

We can go from 1st order to 2nd order in small steps, e.g. adding update of just 3 averages to momentum method we can simultaneously MSE fit parabola in its direction for smarter choice of step size.

https://i.stack.imgur.com/ai1ks.png


ps. I have prepared SGD overview lecture focused on 2nd order methods: slides: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf, video: https://youtu.be/ZSnYtPINcug

Jarek Duda
  • 331
  • 2
  • 14
2

Just some comments:

  1. First order methods have very well theoretical guarantee about convergence and avoidance of saddle points, see Backtracking GD and modifications.
  2. Backtracking GD can be implemented in DNN, with very good performance.
  3. Backtracking GD allows big learning rates, can be of the size of inverse of the size of gradient, when the gradient is small. This is very handy when you converge to a degenerate critical point.

References:

https://github.com/hank-nguyen/MBT-optimizer

https://arxiv.org/abs/2007.03618 (Here you also find a heuristic argument, that backtracking gd has correct unit, in the sense of Zeiler in his adadelta paper)

Concerning Newton’s method: with a correct modification, you can avoid saddle points, as several previous comments pointed out. Here is a rigorous proof, where we also give a simple way to proceed if the hessian is singular

https://arxiv.org/abs/2006.01512

Github link for the codes:

https://github.com/hphuongdhsp/Q-Newton-method

Remaining issues: cost of implementation and no guarantee of convergence.

Addendum:

  1. The paper of Caplan mentioned by LMB: I took a quick look. I don’t think that paper presented any algorithm which computes the Hessian in O(N). It only says that you can compute the Hessian with only N “function evaluation” - I don’t know yet what that precisely means - and the final complexity is still O(N^2). It also did some experiments and says that the usual Newton’s method works better than (L-)BFGS for those experiments.

  2. (related to the previous sentence). I should add this as comments to JPJ and elizabeth santorella but cannot (not enough points) so write here: since you two mentioned bfgs and l-bfgs, can you give a link to sourcodes for these for DNN (for example for datasets MNIST, CIFAR10, CIFAR100) with reported experimental results, so people can compare with first order methods (variants of gd, including backtracking gd), to have an impression of how good they are in large scale?

Tuyen Truong, UiO

user292463
  • 31
  • 3