Why are neural networks better at avoiding local minima?

Question

In simulated annealing, from my understanding, it is a process where it stochastically searches the whole landscape at the beginning for the global minima and then hones down on the best solution it can find. With neural networks, how does it do this when you set the weights only once and randomly. Changes fall straight into local minima with gradient descent. How does it get out of this and find a better solution?

I don't understand your premise. The standard approach to training a neural network (as opposed to a regression model) is backprop and gradient descent. This is explicitly a local optimization, and there is no guarantee that the solution is a global minimum. As a practical matter, these solutions are usually "good enough." The most relevant contrast to simulated annealing is that modern neural nets often have millions of parameters; I don't believe simulated annealing works well in a search space that large (but I might be wrong -- not a SA specialist). — Sycorax, Apr 29 '20 at 16:29
All global optima are also local optima, but not all local optima are global optima. In optimization we're looking for local optima that are also global optima. We often use greedy optimization methods, which do not garuntee that the local optima found by the algorithm are also a global optima. — DifferentialPleiometry, Apr 29 '20 at 16:33
I suspect that the vanilla setup of a multilayer perceptron is more capable of being stuck in a local minima than simulated annealing (under random mutations to the solution). — DifferentialPleiometry, Apr 29 '20 at 16:36
There are many adjustments that are made to neural networks these days, including a paper I saw a while back where they used simulated annealing in the training of a neural network. Simulated annealing is a metaheuristic that can be combined with other optimization techniques, and many choices can be made on how a neural network is trained (some which may trade efficient convergence for less greed). — DifferentialPleiometry, Apr 29 '20 at 16:39
@Sycorax says Reinstate Monica - I can see your point on how with high dimensional data, simulated annealing may not be very useful in some circumstances due to such a huge landscape. What i mean by my premise, is that you start at a point, which is the random weights- then from wherever it starts it's going to just move down into whatever (bowl) local minimum, it is in, where it can't escape. I can see how something like momentum can help- but this is more of an add-on, rather than how a NN works fundamentally.- this is what is confusing me — snackbob, Apr 29 '20 at 16:46
@Galen do you mean instead of random weights at the beginning they give it a better head start simulated annealing? - — snackbob, Apr 29 '20 at 16:51
@snackbob I think you're starting from the premise that neural networks must achieve a global minimum to be "good." But it happens to be the case that local minima are often "good enough," in the sense that the resulting models are very effective. Understanding why neural networks work this way, and characterizing just how different local and global optima are for NNs and other modern ML models, is an active area of research, so my summary here may be glossing over some important findings. — Sycorax, Apr 29 '20 at 16:59
The bottom line is that neural networks are not inherently better at avoiding local-only optima compared to simulated annealing. — DifferentialPleiometry, Apr 29 '20 at 17:02

Jspang · Answer 1 · 2020-04-29T18:00:37.473

2

It's true that if a neural network uses regular gradient descent it will only be able to properly optimize convex functions. In order to address this, most neural networks use some variant on Stochastic Gradient Descent which introduces noise by considering fewer points so the algorithm can jump out of local minima. However, it's worth noting that there is no guarantee to find the global optima and this is usually used over simulated annealing because it is (typically) faster.

edited Apr 29 '20 at 18:00

answered Apr 29 '20 at 17:55

Jspang

116
1
1
3

why is it that they have been shown to be so good at optimization compared to other methods. If that is the case, surely other methods that actively try to escape local minima, ie, genetic algorithms, simulated annealing, would be hone in on a better solution purely by chance. With my current understanding, this defies logic, or at least it shows there is huge scope and potential for better methods to come. – snackbob Apr 29 '20 at 18:09
2

The neural network itself is really just a model (similar to how linear regression is a model) that uses an optimization algorithm to minimize some loss function. These algorithms you have suggested could indeed be used in conjunction with a neural network instead of using stochastic gradient descent and this has been tried in the past; however, there are issues associated with most of these algorithms that have favored stochastic gradient descent variations. Specifically, with simulated annealing, the space is too large to perform this in a reasonable amount of time. – Jspang Apr 29 '20 at 18:34
1

I would recommend checking out this post: https://stats.stackexchange.com/questions/235862/is-it-possible-to-train-a-neural-network-without-backpropagation – Jspang Apr 29 '20 at 18:34

Why are neural networks better at avoiding local minima?

1 Answers1