Most optimization techniques (that I'm aware of) for non-linear cost functions that are commonly implemented rely on linearly updating a variable iteratively until a minimum is reached or a condition is met. The update is itself a function of the gradient. Are there any alternative methods to this, say those that rely on a product update, for example? Is there any result of 'optimality' of the gradient descent, say that it is the best way to get to the optimum?
1 Answers
Are there any alternative methods to this, say those that rely on a product update, for example?
There are a number of alternatives to gradient descent.
- Problems such as OLS are minimizing quadratics, so you can solve them directly. Do we need gradient descent to find the coefficients of a linear regression model?
- Newton's method is a second-order optimization method. See Why is Newton's method not widely used in machine learning?
- Halley's method is a third-order optimization method. See Why not use the third derivative for numerical optimization?
- Non-convex problems have their own elaborate methods which deviate from the pattern of gradient descent. If we think of gradient descent a iterative, sequential updates within some neighborhood, non-convex optimization will tend to bounce around, either randomly (random search) or in an attempt to make educated guesses. See Optimization when Cost Function Slow to Evaluate and related posts for more discussion.
Is there any result of 'optimality' of the gradient descent, say that it is the best way to get to the optimum?
"Best" how? A very general theme in numerical optimization is that different methods have different cost-benefit relationships, so choosing the "best" method is about selecting an acceptable trade-off in terms of accuracy/optimality of the result, the amount of time spent to compute the solution, memory consumption, etc.
If your problem is small and the objective is quadratic, gradient descent can be pointlessly slow and inaccurate because you can solve the problem directly.
If your problem is huge, gradient descent might be nice because it doesn't require factorizations of large matrices (QR decomposition) or solving large linear systems (Newton's method). Comparatively, gradient descent is cheap. This is a large part of the motivation for using gradient descent for training neural-networks.
As a matter of pedagogy, gradient descent is easy to explain and understand and is very general. As a matter of practice, gradient descent is really cheap. But depending on the problem you're trying to solve, "better" solutions can exist, where "better" can be taken to mean "less computation time" or "more precise result" etc.

- 76,417
- 20
- 189
- 313