30

Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.

My question is, given the point on the convex, does the gradient only point at the direction at which the function increase/decrease fastest, or the gradient always points at the optimal/extreme point of the cost function?

The former is a local concept, the latter is a global concept.

SGD can eventually converge to the extreme value of the cost function. I'm wondering about the difference between the direction of the gradient given an arbitrary point on the convex and the direction pointing at the global extreme value.

The gradient's direction should be the direction at which the function increase/decrease fastest on that point, right?

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
CyberPlayerOne
  • 2,009
  • 3
  • 22
  • 30
  • 8
    Have you ever walked straight downhill from a mountain ridge, only to find yourself in a valley that continues downhill *in a different direction?* The challenge is to imagine such a situation with convex topography: think of a knife edge where the ridge is steepest at the top. – whuber Sep 18 '18 at 14:48
  • 4
    No, because it's stochastic gradient descent, not gradient descent. The whole point of SGD is that you throw away some of the gradient information in return for increased computational efficiency -- but obviously in throwing away some of the gradient information you're no longer going to have the direction of the original gradient. This is already ignoring the issue of whether or not the regular gradient points in the direction of optimal descent, but the point being, even if regular gradient descent did, there's no reason to expect _stochastic_ gradient descent to do so. – Chill2Macht Sep 18 '18 at 16:50
  • 1
    @sycorax, how does this question classify under neural networks? – Sextus Empiricus Sep 18 '18 at 20:18
  • 3
    @Tyler, why is your question specifically about *stochastic* gradient descent. Do you imagine somehow something different in comparison to standard gradient descent? – Sextus Empiricus Sep 18 '18 at 20:22
  • 2
    The gradient will always point toward the optimum in the sense that the angle between the gradient and the vector to the optimum will have angle less than $\frac{\pi}{2}$, and walking in the direction of the gradient an infinitesimal amount will get you closer to the optimum. – Reinstate Monica Sep 18 '18 at 20:23
  • 6
    If the gradient pointed directly at a global minimizer, convex optimization would become super easy, because we could then just do a one-dimensional line search to find a global minimizer. This is too much to hope for. – littleO Sep 18 '18 at 20:25
  • @MartijnWeterings The most common usage of SGD is neural network training. Even though the question is framed in terms of convex problems (which usually are not neural networks), I think adding the tag will make a generally useful observation about (S)GD optimization more visible to interested readers working on NNs. – Sycorax Sep 18 '18 at 20:30
  • @MartijnWeterings I was not thinking about SGD particularly when asking this question, so actually I was thinking about GD. But since SGD uses training samples one by one, I think in SGD the direction of gradient should be more random than that if the gradient in GD, right? That's another thing though. – CyberPlayerOne Sep 19 '18 at 05:26
  • @Tyler In SGD, the descent surface is different for every mini-batch. But taking many steps on these different surfaces also leads to a zero-derivative point in the original surface. – Jan Kukacka Sep 19 '18 at 05:44
  • @Sycorax a problem with tagging under neural networks is that people might have this tag under their ignored tags. – Sextus Empiricus Sep 22 '18 at 07:52

6 Answers6

42

They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.

An image of an elongated convex function and arrows showing that the direction of the steepest descent is not the same as the direction towards the global optimum

On a serious note: There are far superior answers in this thread that also deserve an upvote.

Jan Kukacka
  • 10,121
  • 1
  • 36
  • 62
  • 34
    And today's counter-example is ... an avocado! – JDL Sep 18 '18 at 13:28
  • 13
    You see that while cutting an avocado, you should cut in the steepest descent direction to avoid the seed and a [possible injury](http://imj.ie/the-avocado-hand/). – Jan Kukacka Sep 18 '18 at 14:16
33
  • Gradient descent methods use the slope of the surface.
  • This will not necessarily (or even most likely not) point directly towards the extreme point.

An intuitive view is to imagine a path of descent that is a curved path. See for instance the examples below.

As an analogy: Imagine I blindfold you and put you somewhere on a mountain with the task to walk back to the extreme (low) point. On the hill, if you only have local information, then you are not knowing in which direction the bottom of the lake will be.

If you can assume convexity

  • Then you know that there is only one extreme point.
  • Then you know that you are certainly gonna reach the extreme point as long as you move downwards.
  • And then you also know that the angle between the steepest descent direction and the optimum direction is always at most $\pi/2$, as Solomonoff's Secret mentioned in the comments.

convex

Without convexity

  • The angle may exceed $\pi/2$. In the image below this is emphasized by drawing an arrow of direction of descent for a particular point where the final solution is behind the line perpendicular to the direction of descent.

    In the convex problem this is not possible. You could relate this to the isolines for the cost function having a curvature all in the same direction when the problem is convex.

non convex

In Stochastic Gradient Descent

  • You follow the steepest direction for a single point (and you repeatedly take a step for a different point). In the example the problem is convex, but there may be more than one solutions. In the example the extreme values are on a line (instead of a single point), and from this particular viewpoint you could say that The steepest descent direction, may point directly to the "optimum" (although it is only the optimum for the function of that particular training sample point)

single point

Below is another view for four data points. Each of the four images shows the surface for a different single point. Each step a different point is chosen along which the gradient is computed. This makes that there are only four directions along which a step is made, but the stepsizes decrease when we get closer to the solution.

stochastic gradient descent



The above images are for 4 datapoints generated by the function:

$$y_i = e^{-0.4x_i}-e^{-0.8 x_i} + \epsilon_i$$

x = 0      2      4      6           
y = 0.006  0.249  0.153  0.098

which results in:

  • a non-convex optimization problem when we minimize the (non-linear) cost function $$ S(a,b) = \sum_{i=1} \left( y_i - (e^{-ax_i}-e^{-b x_i}) \right)^2$$ $$\nabla S(a,b) = \begin{bmatrix} \sum_{i=1} 2 x_i e^{-a x_i}\left( y_i - e^{-ax_i}-e^{-b x_i} \right) \\ \sum_{i=1} -2 x_i e^{-b x_i}\left( y_i - e^{-ax_i}-e^{-b x_i} \right) \end{bmatrix}$$

  • a convex optimization problem (like any linear least squares) when we minimize $$ S(a,b) = \sum_{i=1} \left( y_i - (a e^{-0.4 x_i}- b e^{-0.8 x_i} )\right)^2$$ $$\nabla S(a,b) = \begin{bmatrix} \sum_{i=1} -2 e^{-0.4x_i}\left( y_i - a e^{-0.4x_i}- b e^{-0.8 x_i} \right) \\ \sum_{i=1} 2 e^{-0.8x_i}\left( y_i - a e^{-0.4x_i}- b e^{-0.8 x_i} \right) \end{bmatrix}$$

  • a convex optimization problem (but not with a single minimum) when we minimize for some specific $i$ $$ S(a,b) = \left( y_i - (a e^{-0.4 b x_i}- b e^{-0.8 x_i}) \right)^2$$ which has gradient $$\nabla S(a,b) = \begin{bmatrix} -2 e^{-0.4x_i}\left( y_i - a e^{-0.4x_i}- b e^{-0.8 x_i} \right) \\ 2 e^{-0.8x_i}\left( y_i - a e^{-0.4x_i}- b e^{-0.8 x_i} \right) \end{bmatrix}$$ this has multiple minima (there are multiple $a$ and $b$ for which $S = 0$ )


Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
19

Steepest descent can be inefficient even if the objective function is strongly convex.

Ordinary gradient descent

I mean "inefficient" in the sense that steepest descent can take steps that oscillate wildly away from the optimum, even if the function is strongly convex or even quadratic.

Consider $f(x)=x_1^2 + 25x_2^2$. This is convex because it is a quadratic with positive coefficients. By inspection, we can see that it has a global minimum at $x=[0,0]^\top$. It has gradient $$ \nabla f(x)= \begin{bmatrix} 2x_1 \\ 50x_2 \end{bmatrix} $$

With a learning rate of $\alpha=0.035$, and initial guess $x^{(0)}=[0.5, 0.5]^\top,$ we have the gradient update

$$ x^{(1)} =x^{(0)}-\alpha \nabla f\left(x^{(0)}\right) $$

which exhibits this wildly oscillating progress towards the minimum.

enter image description here

Indeed, the angle $\theta$ formed between $(x^{(i)}, x^*)$ and $(x^{(i)}, x^{(i+1)})$ only gradually decays to 0. What this means is that the direction of the update is sometimes wrong -- at most, it is wrong by almost 68 degrees -- even though the algorithm is converging and working correctly.

enter image description here

Each step is wildly oscillating because the function is much steeper in the $x_2$ direction than the $x_1$ direction. Because of this fact, we can infer that the gradient is not always, or even usually, pointing toward the minimum. This is a general property of gradient descent when the eigenvalues of the Hessian $\nabla^2 f(x)$ are on dissimilar scales. Progress is slow in directions corresponding to the eigenvectors with the smallest corresponding eigenvalues, and fastest in the directions with the largest eigenvalues. It is this property, in combination with the choice of learning rate, that determines how quickly gradient descent progresses.

The direct path to the minimum would be to move "diagonally" instead of in this fashion which is strongly dominated by vertical oscillations. However, gradient descent only has information about local steepness, so it "doesn't know" that strategy would be more efficient, and it is subject to the vagaries of the Hessian having eigenvalues on different scales.

Stochastic gradient descent

SGD has the same properties, with the exception that the updates are noisy, implying that the contour surface looks different from one iteration to the next, and therefore the gradients are also different. This implies that the angle between the direction of the gradient step and the optimum will also have noise - just imagine the same plots with some jitter.

More information:


This answer borrows this example and figure from Neural Networks Design (2nd Ed.) Chapter 9 by Martin T. Hagan, Howard B. Demuth, Mark Hudson Beale, Orlando De Jesús.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
13

Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?

gunes
  • 49,700
  • 3
  • 39
  • 75
3

The other answers highlight some annoying rate-of-convergence issues for GD/SGD, but your comment "SGD can eventually converge..." isn't always correct (ignoring pedantic usage remarks about the word "can" since it seems you meant "will").

One nice trick for finding counter-examples with SGD is to notice that if every data point is the same, your cost function is deterministic. Imagine the extremely pathological example where we have one data point $$(x_0,y_0)=(1,0)$$ and we have a model for how our system should work based on a single parameter $\alpha$ $$f(x,\alpha)=\sqrt{\alpha^2-\alpha x}.$$

With MSE as our cost function, this simplifies to $$(f(x_0,\alpha)-y_0)^2=\alpha^2-\alpha,$$ a convex function. Suppose we choose our learning rate $\beta$ poorly so that our update rule is as follows: $$\alpha_{n+1}=\alpha_n-\beta(2\alpha_n-1)=\alpha_n-(2\alpha_n-1)=1-\alpha_n.$$ Now, our cost function has a minimum at $\alpha=\frac12$, but if we start literally anywhere other than $p=\frac12$ then SGD will simply bounce between cycle between the starting point $p$ and $1-p$ and never converge.

I'm not sure if convexity is enough to break some worse behavior that exists for general SGD, but if you allow functions even as complex as cubics for your cost function then SGD can bounce around on a dense subset of the domain and never converge anywhere or approach any cycle.

SGD can also approach/obtain cycles of any finite length, diverge toward $\infty$, oscillate toward $\pm\infty$ (excuse the notation), and have tons of other pathological behavior.

One interesting thing about the whole situation is that there exist uncountably many functions (like SGD) which take arbitrary convex functions as inputs and then output an update rule which always quickly converges to the global minimum (if one exists). Even though conceptually there exist loads of them, our best attempts at convex optimization all have pathological counterexamples. Somehow the idea of a simple/intuitive/performant update rule runs counter to the idea of a provably correct update rule.

  • 1
    +1 for this observation. But, this $\beta=1$ is a bit of bad choice and would also be bad in the case of regular gradient descent. It is a good comment but it doesn't really relate to the issue whether the steepest descent path points towards the solution or not, it relates instead to the issue of too large step-sizes that may lead to divergent updating. – Sextus Empiricus Sep 19 '18 at 12:25
  • 1
    Note that SGD convergence proof assumes a decreasing step size... – Jan Kukacka Sep 19 '18 at 14:04
  • @MartijnWeterings Good observation. I guess my example does actually point the correct direction. Should I update it with a 2D example that never points the right direction and diverges? – Hans Musgrave Sep 19 '18 at 15:35
  • @MartijnWeterings Agreed, $\beta=1$ is a bad choice. For any $\beta>0$ though, there exists a pathological cost function for which that $\beta$ fails. One of the easiest ones stems from $f(x,\alpha)=\sqrt{\frac{\alpha^2-\alpha x}{\beta}}.$ – Hans Musgrave Sep 19 '18 at 15:37
  • @JanKukacka That's a common modification to SGD that suffers from a similar flaw. Instead of the cost function being a parabola, you choose $f$ so that the cost function is a symmetric convex function rising rapidly enough in both directions from the minimum to counteract the cooling rate of $\beta$. The SGD convergence proofs I've seen are only with probability 1 and rely on such badly chosen cost functions existing with probability 0 with typical measures on the space of cost functions. – Hans Musgrave Sep 19 '18 at 15:46
  • I guess that it is pathological for any polynomial function $y=\frac{x^2}{\beta}$ which has derivative $\beta \frac{d}{dx} y = 2x$ – Sextus Empiricus Sep 19 '18 at 15:47
  • It would be interesting to see (or hear about) the example in which case you have divergence because the direction is wrong (thus no matter what small step-size you take the solution will diverge). Does it relate to the fact that SGD never getting you 'exactly' to the solution? In the example from my answer you see those four surfaces for four different points which can be seen as moving by random steps to one of the four lines. But the lines do not meet each other exactly in a single common center. Eventually you will get stuck around some 'region' near the minimum. – Sextus Empiricus Sep 19 '18 at 15:50
  • @MartijnWeterings It has to do with the lack of scaling in gradient descent. Exponential functions throw it off. Try to fit $f(x,a,b)=j(e^a+xe^{-a})+k(x^2e^b+x^3e^{-b})$ with $(x_0,y_0)=(1,0)$ with $j,k>0$, and you get a strongly convex cost function using MSE. The function grows so quickly that you diverge rather than converge. If the learning rate is small, make $j$ and $k$ large to keep divergent behavior. To keep the gradient from pointing toward the global minimum, let $j\neq k$. – Hans Musgrave Sep 20 '18 at 04:38
  • @MartijnWeterings One of the comments mentioned a cooling of the learning rate (say $L(n)\to0$ as $n\to\infty$). One trick is to choose any $g(n)$ with $g(n)L(n)>c$ for some fixed positive $c$, $g$ monotone increasing, $g(x)>x$ for all $x>0$, and $g(0)>1$ (I think a smaller lower bound might do). Then consider $h(x,a,b)=f(x,g(a)+g(-a),g(b)+g(-b))$. The resulting mess is strongly convex and grows quickly enough to combat learning rate cooling at arbitrary rates (note that $L\neq0$ if you want to converge and start somewhere other than the solution). – Hans Musgrave Sep 20 '18 at 04:45
  • @MartijnWeterings I am pretty sure that for any **convex** function, there is a learning rate which causes it to converge. Every one of my counter-examples started with a learning rate and found a counter-function. For non-convex functions, there exist loads of examples where no matter the learning rate SGD will diverge rather than approaching the global minimum. I recall that uniform continuity eliminated some of my counter-examples, but I don't think it was sufficient to eliminate them all. It's been awhile since I've done any of this. – Hans Musgrave Sep 20 '18 at 04:50
  • @HansMusgrave, these examples are still more related to problems with learning rate than to problems with direction. The convergence rate will be limited by higher order derivatives and in the case of poorly choosing the learning rate you may get divergence. This is true for any gradient descent method and not particular for SGD and it's way to pick the angle of descent from only a point/subset. – Sextus Empiricus Sep 20 '18 at 09:04
2

Maybe the answers to this question need a quick update. It seems like SGD yields a global minimum also in the non-convex case (convex is just a special case of that):

SGD Converges To Global Minimum In Deep Learning via Star-Convex Path, Anonymous authors, Paper under double-blind review at ICLR 2019

https://openreview.net/pdf?id=BylIciRcYQ

The authors establish the convergence of SGD to a global minimum for nonconvex optimization problems that are commonly encountered in neural network training. The argument exploits the following two important properties: 1) the training loss can achieve zero value (approximately); 2) SGD follows a star-convex path. In such a context, although SGD has long been considered as a randomized algorithm, the paper reveals that it converges in an intrinsically deterministic manner to a global minimum.

This should be taken with a grain of salt though. The paper is still under review.

The notion of star-convex path gives a hint about towards where the gradient would point at each iteration.

Tolga Birdal
  • 205
  • 2
  • 5