Gradient descent decreasing loss

Question

Is the following statement true: Gradient descent is guaranteed to always decrease a loss function.

I know that if the loss function is convex, then each iteration of gradient descent will result in a smaller loss, but does this statement hold for all loss functions? Could we design a loss function in such a way that performing gradient descent was not optimal?

@MatthewGunn The question does not speak to the value of the step size, so I'm assuming it could be anything. — Shrey, Oct 04 '18 at 22:31
I think you're asking two separate things. The first is essentially if GD works for *all* loss functions, i.e. *is there a loss function where GD wouldn't work?* In the second question you ask if there *is a loss function where GD isn't optimal?* The second is quite easy, in any non-convex loss function GD is dependent of its staring point. Imaging if that were true and GD was the optimal way to decrease *any* loss function imaginable. The first question is more interesting, but in my opinion you should give a bit more details (e.g. can the loss function be non-differentiable?) — Djib2011, Oct 04 '18 at 23:14

score 3 · Answer 1 · answered Nov 18 '19 at 03:25

Whether the loss decreases depends on your step size. The direction opposite the gradient is that in which you should move to most quickly decrease your loss. A step of gradient descent is given by:

$$\theta_t = \theta_{t-1} -\eta\nabla_\theta\mathcal{L}$$

where $t$ indexes the training iteration, $\theta$ is the parameter, $\eta$ is the step size, and $\mathcal{L}$ is the loss. When you perform gradient descent, you are essentially linearizing your loss function around $\theta$. If $\eta$ is too large, then $\mathcal{L}(\theta_{t-1} -\eta\nabla_\theta\mathcal{L})$ may be greater than $\mathcal{L}(\theta_{t-1}) -\eta\nabla_\theta\mathcal{L}$, indicating that you overshot the local minimum and increased the loss (see illustration below). Note that it is also possible to overshoot the minimum and still have a decrease in loss (not shown).

What did you use for plotting? To display TeX in R put it into `expression` and in matplotlib in `$$`. — Tim, Nov 18 '19 at 05:42

Gradient descent decreasing loss

1 Answers1