9

Both processes seem to be used to estimate the maximum value of an unknown function, and both obviously have different ways of doing so.

But in practice is either method essentially interchangeable? Where would I want to use one over the other?

https://en.wikipedia.org/wiki/Simulated_annealing

http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

Similar Question
Bayesian optimization or gradient descent?

canyon289
  • 399
  • 2
  • 9
  • 1
    I don't think this is sufficiently exhaustive to be an answer, but simulated annealing generally requires a larger number of function evaluations to find a point near the global optimum. On the other hand, Bayesian Optimization is building a model at each iteration but requires relatively few function evaluations. So depending on how expensive the function is to evaluate, you would prefer one to the other because one will have a smaller wall time: Bayesian Optimization in cases where the function is very expensive and annealing when the function is relatively cheap. – Sycorax Feb 19 '16 at 17:52
  • @Sycorax Bumping 10 posts on more or less the same topic in 10 minutes - a bit excessive don't you think? Apparently not, but I do. – Mark L. Stone Oct 01 '16 at 00:47
  • @MarkL.Stone It's more-or-less "slow time," (8pm on a Friday, as of the time of the edits) which is the preferred time to do this. There's a meta thread. – Sycorax Oct 01 '16 at 01:03

1 Answers1

10

Simulated Annealing (SA) is a very simple algorithm in comparison with Bayesian Optimization (BO). Neither method assumes convexity of the cost function and neither method relays heavily on gradient information.

SA is in a way a slightly educated random walk. The candidate solution jumps around over the solution space having a particular jump schedule (the cooling parameter). You do not care where you landed before, you don't know where you will land next. It is a typical Markov Chain approach. You do not model any strong assumptions about the underlaying solution surface. MCMC optimization has gone a long way from SA (see for example Hamiltonian Monte Carlo) but we will not expand further. One of the key issues with SA is that you need to evaluate a lot of times "fast". And it makes sense, you need as many samples as possible to explore as many states (ie. candidate solutions) as possible. You use only a tiny bit of gradient information (that you almost always accept "better" solutions).

Look now at BO. BO (or simplistically Gaussian Process (GP) regression over your cost function evaluations) tries to do exactly the opposite in terms of function evaluation. It tries to minimize the number of evaluation you do. It builds a particular non-parametric model (usually a GP) for your cost function that often assumes noise. It does not use gradient information at all. BO allows you to build an informative model of your cost function with a small number of function evaluations. Afterwards you "query" this fitted function for its extrema. Again the devil is in the details; you need to sample intelligently (and assume that your prior is half-reasonable too). There is work on where to evaluate your function next especially when you know that your function actually evolves slightly over time (eg. here).

An obvious advantage of SA over BO is that within SA is very straightforward to put constraints on your solution space. For example if you want non-negative solutions you just confine your sample distribution in non negative solutions. The same is not so direct in BO because even you evaluate your functions according your constraints (say non-negativity) you will need to actually constraint your process too; this taske while not impossible is more involved.

In general, one would prefer SA in cases that the cost function is cheap to evaluate and BO in cases that the cost function is expensive to evaluate. I think SA is slowly but steadily falling out of favour; especially the work of gradient-free optimization (eg. NEWQUA, BOBYQA) takes away one of its major advantages in comparsion with the standard gradient descent methods which is not having to evaluate a derivative. Similarly the work on adaptive MCMC (eg. see reference above) renders it wasteful in terms of MCMC optimization for almost all cases.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Thanks for the answer. I see that you're likely right about annealing falling out of favor. scipy deprecated it in favor of basinhopping http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.optimize.anneal.html – canyon289 Feb 20 '16 at 01:30
  • I am glad I could help. Thanks for the tip; I was not aware of that change in SciPy. – usεr11852 Feb 20 '16 at 04:31
  • Unless the constraints are really gnarly, what is the big deal with constraining a GP fit? Of course, when you "query" the fitted function, you perform a constrained optimization. I'm not trying to be sarcastic, I really want to know what difficulties you see. For example, linear equality and inequality constraints should be a piece of cake. If you have non-convex constraints, such as nonlinear equality constraints or integer constraints, those might fall in my gnarly category. – Mark L. Stone Oct 01 '16 at 00:51
  • @MarkL.Stone: Even linear constraints (let alone the *gnarly* ones) can affect the fitting in higher dimensions severely - even if you fit "*something*" I would seriously doubt that this fit would be accurate representation of what you want. In addition, most continuity-based results behind GPR optimality go out of the window... Just to be clear: I have not used BO extensively because it always proved suboptimal for the problems I work with. Assuming standard Quasi-Newton method fail, I would always advocate first a derivative-free or an HMC approach. – usεr11852 Oct 02 '16 at 00:20
  • Well, if I have constraints, what I want is for the fitted function to satisfy the constraints. Believe me, I have my doubts how well a GP fit will be an accurate representation of what i want, constraints or not. Good constraints can help you - they constrain things to where they should be and save you from wasting time in bad regions. Of course, that is if they are well implemented. Can you give an example of a continuity based result behind GPR optimality which goes out the window when there are linear constraints? To be valid example, it better have been in the window without constraints. – Mark L. Stone Oct 02 '16 at 01:01
  • "Assuming standard Quasi-Newton method fails" Depends on what "standard" means and why it failed. Not all QN methods and implementations are equal. I can tell you that. if the function is not differentiable in many places, perhaps you should abandon QN, but if you have a robust implementation and function is differentiable, then QN shouldn't fail to find a local optimum (might be a little tough if problem is horribly scaled). If you are using a piece of junk QN implementation, as unfortunately many are, then it may be a different story. As for finding global optimum, that's a different story. – Mark L. Stone Oct 02 '16 at 04:37