1

Is the following statement true: Gradient descent is guaranteed to always decrease a loss function.

I know that if the loss function is convex, then each iteration of gradient descent will result in a smaller loss, but does this statement hold for all loss functions? Could we design a loss function in such a way that performing gradient descent was not optimal?

Shrey
  • 175
  • 4
  • 1
    Imagine the loss function $f(x) = 0$ – shimao Oct 04 '18 at 21:59
  • How is the step size ascertained? – Matthew Gunn Oct 04 '18 at 22:17
  • @MatthewGunn The question does not speak to the value of the step size, so I'm assuming it could be anything. – Shrey Oct 04 '18 at 22:31
  • I think you're asking two separate things. The first is essentially if GD works for *all* loss functions, i.e. *is there a loss function where GD wouldn't work?* In the second question you ask if there *is a loss function where GD isn't optimal?* The second is quite easy, in any non-convex loss function GD is dependent of its staring point. Imaging if that were true and GD was the optimal way to decrease *any* loss function imaginable. The first question is more interesting, but in my opinion you should give a bit more details (e.g. can the loss function be non-differentiable?) – Djib2011 Oct 04 '18 at 23:14

1 Answers1

3

Whether the loss decreases depends on your step size. The direction opposite the gradient is that in which you should move to most quickly decrease your loss. A step of gradient descent is given by:

$$\theta_t = \theta_{t-1} -\eta\nabla_\theta\mathcal{L}$$

where $t$ indexes the training iteration, $\theta$ is the parameter, $\eta$ is the step size, and $\mathcal{L}$ is the loss. When you perform gradient descent, you are essentially linearizing your loss function around $\theta$. If $\eta$ is too large, then $\mathcal{L}(\theta_{t-1} -\eta\nabla_\theta\mathcal{L})$ may be greater than $\mathcal{L}(\theta_{t-1}) -\eta\nabla_\theta\mathcal{L}$, indicating that you overshot the local minimum and increased the loss (see illustration below). Note that it is also possible to overshoot the minimum and still have a decrease in loss (not shown).

enter image description here

Vivek Subramanian
  • 2,613
  • 2
  • 19
  • 34