Convergence under large set of learning rates

Question

What is the interpretation of a stochastic optimization problem where a gradient descent algorithm is converging under a wide range of learning rate schedules (including ones with quite large initial values, i.e. when taking large gradient steps)? Does it mean that the objective function landscape contains a large attractor where the iterative process always fall in and gets trapped? In this case how can someone escape from this?

score 2 · Answer 1 · answered Jun 29 '20 at 04:14

For a "nice" convex optimization problem, with a single global minimum, everywhere differentiable and no plateaus, SGD will converge given two conditions on the learning rate $\gamma$: $$ \sum_{t=1}^\infty \gamma_t = \infty , ~~ \sum_{t=1}^\infty \gamma_t^2 < \infty $$

(See Bottou L., Online Learning and Stochastic Approximations for full proof.)

This means that the scale of learning rate is irrelevant to prove eventual convergence: intuitively, imagine an algorithm starting with $\gamma_0 = 100c$. Since the learning rate is decreasing, after some number of steps $t$ the algorithm, if not converged, will reach $\gamma_t = c$, and have an estimate $w_t = w$. At that point it is identical to an algorithm that started with $\gamma'_0 = c$ with an initial guess $w'_0 = w$. (Subject some assumptions about the path of decrease of $\gamma$.)

Therefore, for the algorithm to not converge to the minimum with such learning rates, some condition on the gradient itself must be broken, i.e. the minimum is not unique, or the gradient vanishes in some plateau region. Saying anything more concrete is difficult.

One example could be if there is a large convex region which includes a local minimum and initial values: with small learning rates, we never leave this region and converge, while large rates are more likely to produce a jump outside of this region, into some plateau or another valley, from which the algorithm won't return. Different starting points should help identify such regions. The actual probability of staying in the local valley will depend on the shape of the transition area, "depth" of the valley and other parameters in complicated ways; I am not familiar with any formal way to treat this, although there might be some publications given the general interest in SGD nowadays.

This is a useful summary of results for *convex* optimization. However, since the question bears the tag **[tag:non-convex]**, perhaps you could explain how this pertains to a non-convex optimization task? — Sycorax, Jun 29 '20 at 05:11

Convergence under large set of learning rates

1 Answers1