Why is Gradient Descent considered the "go to" Algorithm for Optimizing Neural Networks?

Question

Does anyone know why "Gradient Descent" Algorithms are considered as the "go to" choice of optimization algorithms for neural networks?

In particular, I am interested in knowing why "Random Search" is not the "default" choice of optimization algorithm for neural networks (and other statistical models). I have seen proofs that show the convergence of gradient descent algorithms :

But I am interested in knowing if there are any theoretical reasons which suggest that "it makes more sense to use gradient based methods compared to random search."

I am aware that random search might have some advantages over gradient descent when the derivative of the loss function is non-differentiable, noisy or too expensive to compute - but at least for some theoretical instance when the loss function is convex and fully observable (e.g. in most statistics/machine learning applications, we only have realizations from the loss function and not access to the actual loss function) - are there any theoretical results which show that gradient descent is "better" (e.g. stronger convergence in few iterations) compared to random search?

At the core, "random search optimization" seems inefficient at optimizing complex functions (I agree that random search might not get stuck in local minimums as does gradient descent, but this fact still does not seem like a reason to favor random search over gradient descent in classic mlp neural networks). My logic is that the random search algorithm does not have the ability to "exploit" successful results it stumbled across in the past iterations - the results of each new iteration are independent from previous iterations. This is in contrast with gradient based algorithms can exploit successful results from previous iterations.

This can be seen here when compared to gradient based algorithms (e.g. Newton-Raphson) - I demonstrate this by comparing both algorithms to optimize a simple function (x^3 - 2x - 5) using the R programming language:

# define function

func2 <- function(x) {
  x^3 - 2* x - 5
}

#newton-raphson definition taken from here: https://rstudio-pubs-static.s3.amazonaws.com/205225_5991ce162f504576a84eac9c659aeaaf.html

#result of newton-raphson (imagine we have no idea how "fun2" looks like, so we decide to search over the range of -100 to 100)

 newton.raphson(func2, -100, 100)
$`root approximation`
[1] 2.094551

$iterations
 [1] -66.6709447 -44.4535887 -29.6448834 -19.7763614 -13.2024841  -8.8258505  -5.9131113  -3.9701043  -2.6532168  -1.6923217
[11]  -0.7120084  -8.9288270  -5.9816657  -4.0160240  -2.6849961  -1.7176501  -0.7495646 -13.2218225  -8.8387212  -5.9216799
[21]  -3.9758448  -2.6571927  -1.6955013  -0.7167986  -9.2966057  -6.2264819  -4.1798748  -2.7979524  -1.8062203  -0.8713348
[31]  13.2419211   8.8711801   5.9860069   4.1137209   2.9574769   2.3405992   2.1229681   2.0949959   2.0945516   2.0945515

#compare to random search

library(randomsearch)
#(search from -100 to 100)
 res = randomsearch(func2, lower = -100, upper = 100, minimize = TRUE, max.evals = 100)
 rs = summary(res)

   #view results of random search 
    rs 

 Randomsearch Result: 
Eval no.: 88
x: x=-98.7
y: -960425.4

As we can see, the gradient based algorithm was able to find the true minimum of this function in around 30 iterations whereas random search took even longer to return a value that was no where close to the real minimum (I repeated the random search several times and consistently noticed this).

Thus, in the end: Are there any theoretical results (e.g. relating to strength of convergence) which specifically show why Random Search algorithms are typically outperformed by Gradient Based Algorithms (and are therefore preferred) for optimizing loss functions (as they have become the default choice of optimization algorithms in classic mlp neural networks)? Are there any theoretical results which demonstrate this at least for optimizing "simple", fully observable, convex functions (like I demonstrated for f(x) = x^3 - 2x - 5)?

Thanks!

For this 1d problem, define an interval around the true solution that is a "success," and then compute the probability that a random search iteration will choose a point in the success interval. When the success interval is short or the search interval is wide, there's a small probability of randomly choosing a point in the success interval. (And when the success interval is wide, the results are unlikely to be valuable in the sense of providing good model fit.) Then consider what happens when search dimension increases: https://stats.stackexchange.com/questions/99171/ — Sycorax, Jan 13 '22 at 22:19
@stats555 You can't beat seeing for yourself: Code up both approaches for some simple problems. Run them. Now try going up a few dimensions. 'nuff said. — Glen_b, Jan 14 '22 at 03:51

Why is Gradient Descent considered the "go to" Algorithm for Optimizing Neural Networks?

0 Answers0