Are line search methods used in deep learning? Why not?

Question

A lot of tutorials online talk about gradient descent and almost all of them use a fixed step size (learning rate $\alpha$). Why is there no use of line search (such as backtracking line search or exact line search)?

"And almost all of them use a fixed step size" - are you sure? ["learning rate"](http://cs231n.github.io/neural-networks-3/#ratio) hyper parameters are supposed to adapt the step size to the conditions. A very popular [Adam algorithm](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/) does adapt the step size — Aksakal, Jan 04 '18 at 18:44
hmm, actually adaptive step size gradient methods have been around since at least 2011, and they are even cited on the Wikipedia [Stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) page. It's not exactly hot news. Even vanilla SGD is nearly always used with a learning rate which changes with the number of iterations (*schedule*). Now, a *very* good question would be: why, even if there are so many adaptive gradient descent methods, SGD still dominates the Deep Learning world? The question is much less trivial than it might seem. — DeltaIV, Jan 04 '18 at 18:47
@Aksakal yes, I think it is not fixed. But I never seen line search used... — Haitao Du, Jan 04 '18 at 18:52
Backtracking line-search fixes a direction and then looks for a way to reduce the function. So unless you have an intelligent way of picking the direction to search in, you're in for a tedious optimization. — Alex R., Jan 04 '18 at 18:54
I don't see that line search makes sense for SGD (as opposed to [batch] gradient descent) - so I would say that's the reason. — seanv507, Jan 04 '18 at 19:30
I suspect the reason why line search is not very popular is the batching in gradient descent. You get a batch, then compute the gradient. It doesn't make a lot of sense to be going back and forth the line because of the noise in the gradient. It's better to keep going with the next batch while maybe annealing the step size. — Aksakal, Jan 05 '18 at 04:19
Line search is not recommended for machine learning applications because it is computationally very expensive. — Derrell D'Souza, Aug 07 '18 at 17:38
@DerrellD'Souza **exact** line search is expensive, but wouldn't **backtracking** line search just require a few forward passes (which can be done on the mini-batch at hand)? — Josh, Jun 01 '20 at 17:27

Cliff AB · Accepted Answer · 2018-01-05T17:13:00.630

17

Vanilla gradient descent can be made more reliable using line searches; I've written algorithms that do this and it makes for a very stable algorithm (although not necessarily fast).

However, it makes almost no sense to do a line search for stochastic gradient methods. The reason I say this is that if we do a line search based on minimizing the full loss function, we've immediately lost one of the main motivations for doing stochastic methods; we now need to compute the full loss function for each update, which typically has computational cost comparable to computing the full first derivative. Given that we wanted to avoid computing the full gradient because of computational costs, it seems very unlikely that we want be okay with computing the full loss function.

Alternatively, you might think of doing something like a line search based on your randomly sampled data point. However, this isn't a good idea either; this will tell you nothing about whether you have stepped too far (which is the main benefit of line searches). For example, suppose you are performing logistic regression. Then each outcome is simply a 0 or 1, and for any single sample, we trivially get perfect separation so the optimal solution for our regression parameters based on the sample of 1 is trivially $-\infty$ or $\infty$ by the Hauck Donner effect. That's not good.

EDIT

@DeltaIV points out that this also applies to mini-batch, not just individual samples.

edited Jan 05 '18 at 17:13

answered Jan 05 '18 at 01:30

Cliff AB

17,741
1
39
84

4

very nice (+1), but I'm not sure why in the last example you talk about a single sample. I agree that computing the line search based on a mini-batch makes no sense, but a mini-batch still contains 512 samples (usually, and when talking about ImageNet): of course there's no fixed value for the number of samples in a mini-batch, but 1 sample mini-batches feel a bit extreme. Did you use them just to make your point more clear, or am I missing something? – DeltaIV Jan 05 '18 at 09:20
2

@DeltaIV: single sample is mostly to make a point about how bad it could be on a very simple problem. If we did mini-batch with 512 samples on logistic regression with 512+ covariates, we would see the same issue. – Cliff AB Jan 05 '18 at 17:08
@DeltaIV and Cliff - Hmm, exact line search could be very expensive (actual minimization in the direction of the gradient), but **backtracking line search** can be done by testing the Loss function value in the direction of the gradient by simply doing a few forward passes (which iterative mini-batch and SGD requires anyways). Could not one argue that iterating these two steps **feed forward + backpropagation** would actually be _more_ expensive than testing feed-forward for multiple step sizes? – Josh Jun 01 '20 at 17:23
@Josh: I may be confused on your question, but the key point is that even evaluating the target objective *once* scales with the sample size, while a single batch does not. So just doing a single check of the loss function is often order**s** of magnitude more expensive than the actual SGD step meaning you immediately lose the advantage of SGD: fast updates. – Cliff AB Jun 01 '20 at 19:16
Thanks @CliffAB but doesn't SGD also technically evaluate the loss (on the batch) on each step as the first step in backpropagation to compute the gradient? i.e. I'm not suggesting backtracking line search on the full Loss, but on the Loss on the batch (I know that wouldn't be backtracking line search on the full Loss per se) – Josh Jun 01 '20 at 20:20
@Josh: ah, okay, you are suggesting computing the loss over the sample. Unfortunately this is not generally a good idea, see the last paragraph of my answer. – Cliff AB Jun 02 '20 at 01:07
Thanks @CliffAB -- yes, but I don't follow. SGD itself requires _"computing the loss over the sample"_ as well, doesn't it? (i.e. backpropagation requires knowing the error on the sample, i.e. the loss output on the sample). If I'm missing something obvious here please let me know. Happy to follow up on a separate question for further clarification if helpful. Thanks again. – Josh Jun 02 '20 at 01:15
@Josh: the key is that you *don't* want to optimize the loss over the small batch. This will never converge and will generally provide very bad estimates. On the other hand, using SGD with a small learning rate will get very close to the optimal estimate. – Cliff AB Jun 02 '20 at 02:00
OK thanks @CliffAB got it, so just to be clear, while evaluating the loss for the sample is computationally feasible and "on par" in complexity with what it already takes to compute **each SGD step** (correct me if I'm mistaken), what you are saying is that it may be not be a good idea since it may not converge and provide bad estimates. Did I get that right? Just trying to make sure we capture the right reasons in the interest of leaving a good record here for others who may read this and help improve it over time. – Josh Jun 02 '20 at 02:10

Tim · Answer 2 · 2018-01-04T22:39:25.730

10

The tutorials talk about gradient descent presumably because it is one of the simplest algorithms used for optimization, so it is easy to explain. Since most of such tutorials are rather brief, they focus on simple stuff. There are at least several popular optimization algorithms beyond simple gradient descent that are used for deep learning. Actually people often use different algorithms then gradient descent since they usually converge faster. Some of them have non-constant learning rate (e.g. decreasing over time). For review of such algorithms you can check the An overview of gradient descent optimization algorithms post by Sebastian Ruder (or the arXived paper).

edited Jan 04 '18 at 22:39

answered Jan 04 '18 at 18:47

Tim

108,699
20
212
390

good answer (+1) but it may be even better (bounty?) if you added a discussion of why we still rely so much on SGD, even if Adam, AdaGrad and RMSProp have been around for quite a few years now. – DeltaIV Jan 04 '18 at 18:53
+1 when you say "adaptive learning rate" are line search methods used? – Haitao Du Jan 04 '18 at 18:54
2

@DeltaIV: All the "other" fancy methods are built on top of SGD. The main issue is that the other methods take advantage of local knowledge to make more efficient jumps, rather than just randomly sampling points to compute the gradient on. But SGD is so simple and fast, and it's not completely terrible on its own. – Alex R. Jan 04 '18 at 19:11
1

@AlexR. You've got it wrong. The other methods are not simply improvements over SGD. They all generalize *worse* than SGD. Choose any recent architecture and try to train it on ImageNet with Adam (for example). I bet that using SGD I'll get a better generalization error than you, for the same number of iterations, as long as we get to the plateau regime. This is well-known: see [here](https://arxiv.org/abs/1712.07628) for a (basically failed) attempt at overcoming this – DeltaIV Jan 04 '18 at 19:40
2

@AlexR. the point is neither that SGD is simple and/or fast. Simplicity doesn't matter, since all decent libraries implement SGD, Adam,AdaGrad and RMSProp (and more, sometimes). Speed matters even less, because the time spent by, e.g., Adam, to compute the parameter-level updates is infinitesimal compared to the overall training time of a model like ResNet. The only point is that, for some reason we do not fully understand today, SGD generalizes better than them. So basically if you want to beat SOTA, you're often *forced* to use it, or at least to switch to it later on during training. – DeltaIV Jan 04 '18 at 19:47
1

@DeltaV: I guess we were talking about different things. The fancy SGD methods are unarguably faster in terms of converging to *something* and less getting stuck in local-minima. Whether or not the models generalize well is a completely different issue. The point is that vanilla SGD with a fixed-scheme learning rate alone is likely *not* the best solution for these optimizations. The paper you quoted proposes a mixed scheme of Adam and SGD, precisely because Adam is much better initially and because SGD performs better later on. – Alex R. Jan 04 '18 at 20:01
1

@DeltaIV: On the subject of why, one really neat explanation is through entropy-based SGD. See this paper: https://arxiv.org/abs/1611.01838.pdf – Alex R. Jan 04 '18 at 20:05
1

@AlexR. it's not a completely different issue: in fact, it's an extremely related issue. See all the literature on sharp vs wide minima. And the paper I quoted shows exactly that: by mixing Adam and SGD, they fail at achieving Adam's speed in the initial phase of the optimization, and the lower generalization error of SGD in the final phase (Figure 4) in all but two cases, which are however the only cases where Adam already matched the generalization error of SGD. Thus in these two cases they approach is actually *worse* than Adam, and in all other cases it doesn't match the generalization 1/ – DeltaIV Jan 04 '18 at 21:51
1

2/ error of SGD. That's why the paper basically disappoints, even though it can indicate a direction for future research. I didn't know about entropy-based SGD and I thank you for the link, I'll read it. – DeltaIV Jan 04 '18 at 21:54
3

@DeltaIV Very interesting. I opened the paper you linked to, and it references [Wilson et al 2017](https://arxiv.org/abs/1705.08292) preprint for the claim that SGD generalizes better than Adam etc.; so when you say that it's "well-known", you mean well-known since around half a year, right? – amoeba Jan 04 '18 at 23:47
Jeez, maybe I'm just old or something...but these papers comparing the out-of-sample error with different optimization algorithms are just a big mess. Let's call the numerical analysts back and see if they can save us. – Cliff AB Jan 05 '18 at 01:49
1

@amoeba :-P no, I don't mean that. First of all, the fact that SGD generalized better was known in the community well before Wilson et al's paper. Adagrad paper is from 2011, Adadelta is from 2012 and Adam is from 2014: ResNet in 2015 and DenseNet in 2016 all use SGD. These are all smart guys which are more up to date on Deep Learning than me (and many others here), whose models each beat SOTA on ImageNet, so why did they all use SGD if these adaptive methods were so good? Secondly, if we really need a paper to date the discovery that SGD generalizes better, that would be 1/ – DeltaIV Jan 05 '18 at 09:12
2/ https://arxiv.org/abs/1609.04836, so more than one year ago. But it was "common knowledge" before that...and even today it's proving hard to beat the generalization ability of SGD. See for example https://openreview.net/forum?id=HJjePwx0- which uses a trust region method (which, you will agree with me, is not hot news in the optimization world). They still struggle to prove that their stochastic TR method generalizes better than SGD. SGD is a crappy optimization algorithm, but "unreasonably good" at training DNNs (at least, CNNs) with low generalization error. – DeltaIV Jan 05 '18 at 09:16
2

@DeltaIV Thanks. I am not doing much of the deep learning myself, and I was not aware of that at all. Back in 2012 or so when I was watching Hinton's Coursera lectures, he was mainly advocating RMSprop and in recent 1-2 years my impression was that everybody is using Adam (which supersedes RMSprop, according to the Adam paper). When I was [playing with autoencoders](https://stats.stackexchange.com/a/307746/28666) last year, I realized that Adam works much faster than SGD, and since then just assumed that Adam is a default choice nowadays. – amoeba Jan 05 '18 at 10:22
1

@amoeba ah the Hinton course! Excellent choice. I may be wrong but I think Hinton never published a specific paper on RMSProp, the "legend" has it that he basically introduced it in the Coursera course. Adam is indeed much faster than SGD in reducing the error initially, but for some reason it ends up in a final minimum which corresponds to an higher error on the test set than SGD. It's really unfortunate - but I think we'll find something better than SGD by 2020. – DeltaIV Jan 05 '18 at 11:50
amoeba + deltaIV: in grad school, my advisor was at one point discussing how early stopping is a (fairly undisciplined) form of regularization; basically the idea being that toward the end of algorithm, you're just fitting noise. Personally, I haven't seen any convincing arguments in the papers discussed that anything else is going on when the authors claim "SGD results > Adaptive methods results"; SGD is slower, so it's less overfit after 100 epochs. But to be honest, that's based on quick skimming. – Cliff AB Jan 05 '18 at 17:06
@CliffAB I remember in Andrew Ng's lecture, he mentioned about early stopping is not good, since it is mixing a lot of tings together (not tuning "hyper parameters using orthogonalization" ). Specifically, he thinks we should first reduce the bias of the model (doing well in training) then, reduce the variance of the model (try to generalize better in testing). Early stopping mixing these things together and make debugging more difficult. – Haitao Du Jan 05 '18 at 17:55
3

@CliffAB Yes, the relationship between early stopping and regularization can be clearly seen for least squares, where [gradient descent operates in the eigenvalue basis](https://stats.stackexchange.com/a/321617/28666) and small eigenvalues are the last ones to converge; whereas ridge penalty also penalizes small eigenvalues. I had now only a quick glance into Wilson et al. linked above, but at least in their least squares example SGD vs Adam different is *not* explained by early vs late stopping. They claim that they converge to different solutions. – amoeba Jan 05 '18 at 19:48
1

@CliffAB I second amoeba's observation that overfitting is not a plausible explanation of the difference in performance. See for example https://arxiv.org/pdf/1609.04836.pdf section 2.2 onwards, where they specifically rule out overfitting as a root cause. Rather, it *seems* that Adam, or by SGD with large batches, find sharper minima. They cannot claim this with certainty because sharpness cannot be measured accurately (it would require a huge computational effort), but the authors use an approximation, so they at least show some plausible evidence. – DeltaIV Jan 06 '18 at 08:43
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/71339/discussion-on-answer-by-tim-are-line-search-methods-used-in-deep-learning-why-n). – Tim Jan 08 '18 at 10:51

score 3 · Answer 3 · answered Jul 27 '21 at 10:10

This question was asked in early 2018, but if you still wait for an answer then: Yes, now there are some implementation of line search in DNN with good performances. See:

https://arxiv.org/abs/1808.05160 (published in 2 journals)

and more recently, by a different group:

https://arxiv.org/abs/1905.09997 (published in ICML).

It is important to note that instead of many claims: 1st, the theoretical results in the stochastic settings are not strong enough (for convex functions only, most DNN are highly nonconvex) and 2nd, implementation in DNN requires some modifications to suit with mini-batch practice (stochastic optimisation, while close, is not the same as mini-batch practice).

P.S. The performances were done on CIFAR10 and CIFAR100. Now more resources and personnel needed to check with a larger range of datasets/DNN.

You can also look at the Wikipedia page:

https://en.wikipedia.org/wiki/Backtracking_line_search

Are line search methods used in deep learning? Why not?

3 Answers3

Linked