Intuition Behind Accelerated First Order Methods

Question

$\newcommand{\prox}{\operatorname{prox}}$ $\newcommand{\argmin}{\operatorname{argmin}}$

Suppose that we want to solve the following convex optimization problem:

$\min_{x \in \mathbb{R}^n} g(x) + h(x)$

where we assumed that $g(x)$ is convex and differentiable, $h(x)$ is convex (here I am trying to be as non-specific as possible). Then recall that the generalized gradient descent can be formulated as follows:

Step $0$: choose initial $x^{0} \in \mathbb{R}^n$

Step $k: k \ge 1$: $x^{(k)} = \prox_h (x^{(k-1)} - t_k \nabla g(x^{(k-1)}), t_k)$

where $\prox$ is a proximal operator defined as $\prox_h(y,t) := \argmin \limits_{x \in \mathbb{R}^n} h(x) + \frac{1}{2t} \|y-x\|^2$

It is known that if $\nabla g(x)$ is Lipschitz continuous and proximal operator can be evaluated, the convergence rate of will be $O(1/k)$. This result can be accelerated to achieve $O(1/k^2)$.

First time proposed by Nesterov in 1983 for smooth functions, the idea of acceleration still remains an active topic of research (for non-smooth, composite functions, etc.). It is not easy to read Nesterov's works (very mathematical), but in order to get an understanding of the concept it is sufficient to look at ISTA (Iterative Tthresholding Algorithms) and FISTA (Fast Iterative Thresholding Algorithms). In particular, my questions below will be based on FISTA's example:

Roughly speaking, acceleration is achieved by introducing one more sequence of numbers $y_k$ constructed as a specific linear combination of $x_k$ and $x_{k-1}$; proximal function operates then on $y_{k}$ instead of $x_k$. In case of FISTA we have:

$t_{k+1} = \frac{1 + \sqrt{1 + 4t^2_k}}{2}$

$y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})$

Note, that the sequence $t_k$ satisfies $t^2_k = t^2_{k+1} - t_{k+1}$; this is justified in the proof of the convergence for this algorithm.

Is there any intuitive way to explain, interpret such an approach? Why such a specific combination works and brings such a perceptible improvement to the convergence rate? Can we find an intuitive way to interpret $t$? Probably somebody is actually familiar with Nesterov's works and have more knowledge that I do about some other reasons why $t_k$ is given in this form at first place?

I should know this, because I use Nesterov's method frequently, but I can't think of a good intuitive explanation off hand. If I wasn't so busy I would look over my optimization notes and figure this out...if no one answers in the next few days I will. — icurays1, Aug 21 '14 at 04:58
Heck, I've co-authored a peer-reviewed journal article on accelerated first-order methods and I don't have a good *intuition* about why the "acceleration" works. — Michael Grant, Aug 21 '14 at 17:51
I've also spent a lot of time working with these methods and I don't have good intuition for it either. I'm not sure if anyone does. But if someone has a nice answer it will be very interesting. — littleO, Aug 24 '14 at 22:49
Here's the best I can do in the span of a comment. Assume the following: 1) all you know about the function is that it is continuously differentiable with Lipschitz constant $L$; and 2) at each step, you are allowed to use a linear combination of the current gradient *and all previous gradients* to construct your step. Now set it up as a game: you decide what your linear combination will be, and the "devil" will construct the worst possible function for that linear combination. The best you can do, it turns out, are various "optimal" methods that achieve $O(1/k^2)$ complexity. — Michael Grant, Aug 25 '14 at 21:12
@MichaelGrant That's a good explanation. Given that idea, I wonder how hard it is to work out the details and discover optimal first order methods. — littleO, Aug 26 '14 at 22:56
An interesting interpretation of accelerated methods is given in the paper "[Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent](https://arxiv.org/abs/1407.1537)" by Zeyuan Allen-Zhu and Lorenzo Orecchia. — littleO, Dec 28 '17 at 21:29
Here is a paper posted in May 2020 that gives an interesting derivation of Nesterov acceleration: "[From Proximal Point Method to Nesterov's Acceleration](https://arxiv.org/abs/2005.08304)" by Kwangjun Ahn. "we provide a complete understanding of Nesterov's accelerated gradient method (AGM) by establishing quantitative and analytical connections between PPM and AGM. The main observation in this paper is that AGM is in fact equal to a simple approximation of PPM, which results in an elementary derivation of the mysterious updates of AGM as well as its step sizes." — littleO, Jul 05 '20 at 02:59

score 5 · Answer 1 · answered Jul 30 '15 at 08:09

I asked this question myself while trying to understand accelerated methods, asked around, and was pointed to this paper by a professor to help gain some intuition: http://statweb.stanford.edu/~candes/papers/NIPS2014.pdf

To summarize: Su et. al. take Nesterov's accelerated gradient method and take the step size to be infinitesimally small to derive the following ODE $$\ddot{X} + \frac{3}{t} \dot{X} + \nabla f(X) = 0$$ with initial conditions $X(0) = x_0$ and $\dot{X}(0) = 0$. By analyzing this ODE, we can get a better idea about what Nesterov's accelerated gradient method is doing. Hope this resource is helpful!

score 2 · Answer 2 · answered Jan 10 '17 at 13:43

I think the best intuition up to now, is the geometric interpretation of the algorithm by Sebastin Bubeck. It is based on the idea that based on the information available, we know the optimal $x^*$, reside in the intersection of two circles which can be identified from current point $x_k$. For seeing how it is please go to Convex Optimization: Algorithms and Complexity. Also, on his weblog, he described the algorithm simpler, here: I'm a bandit

score 1 · Accepted Answer · answered Aug 29 '14 at 15:07

1

Some form of short interpretation is given by Prof. L. Vandenberghe, UCLA, here http://www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf

Slide 5; though not very informative, given the lack of answers, I am just going to think of it as extrapolation.

answered Aug 29 '14 at 15:07

trembik

1,169
1
12
19

But thinking of it as extrapolation doesn't explain the very special term $\frac{k-1}{k+2}$ that appears in the FISTA iteration. – littleO Oct 10 '14 at 10:30

score 0 · Answer 4 · answered Aug 05 '19 at 17:14

Amir Beck himself said in one of his lectures that there is no intuition to be gained - the algebra simply checks out. So while intuitions are nice, sometimes it might be reasonable not to try too hard to get them..

Either way, in Nesterovs lecture notes (page 68) he more or less constructs the extrapolation parameter via estimate sequences of functions. This might be easier to read than his original paper (still not easy) and give some intuition.

Intuition Behind Accelerated First Order Methods

4 Answers4