practical implementation detail of Bayesian Optimization

Question

I'm giving Bayesian Optimization a go, following Snoek, Larochelle, and Adams [http://arxiv.org/pdf/1206.2944.pdf], using GPML [http://www.gaussianprocess.org/gpml/code/matlab/doc/]. I've implemented the Expected Improvement acquisition function described on page 3, and I'm assuming I'm correct that to decide where to next query my objective I should take the $\bf{x}$ that maximizes:

$a_{EI}(\bf{x}; (\bf{x}_n,y_n,\theta))$

But I can't seem to find guidance on what set of candidates $\bf{x}$'s to consider. Theoretically, I'd like to find the best $\bf{x}$ over the entire domain, and the paper is written in such a way that seems to suggest this is possible ("[EI] also has closed form under the Gaussian process"). But as a practical matter, I need to calculate the posterior predictive means and variances at any $\bf{x}^*$ I might consider before I can calculate $a_{EI}(\bf{x}^*)$ and while these posteriors have a closed form, I still need to calculate them using matrix algebra, so I can't see a way to get around picking a bunch of $\bf{x}$'s.

The question: what is a practical method for choosing the large (medium? small?) set of candidate $\bf{x}$'s over which I maximize EI (or any other acquisition function)? (Is this in the paper somewhere and I just missed it?)

At the moment, I'm just taking my current set ${x_i}$, sampling it with replacement 2000 times, and then adding some Gaussian noise to each point. Seems okay, I guess.

I haven't read this paper yet, but have you considered sampling from the domain of interest using a Latin Hypercube Design? https://en.wikipedia.org/wiki/Latin_hypercube_sampling — RustyStatistician, Feb 05 '16 at 17:14
An alternative would be to grid the domain (if its possible) and then evaluate every point on that gridded domain. — RustyStatistician, Feb 05 '16 at 17:15
These are both sensible suggestions, thanks! Don't know much about Latin hypercubes, but regular gridded domains would mean Toeplitz and Kronecker algebra could be used, which would make things nice and efficient even with a very large grid. — stackoverflax, Feb 06 '16 at 17:36

lacerbi · Accepted Answer · 2016-03-02T15:09:44.963

5

The norm is to use any global optimizer you like. The problem is that the EI surface is highly multi-modal and disconnected; optimizing this acquisition function is a nontrivial problem in itself.

A common choice that I have seen in various papers is the DIRECT algorithm; sometimes I've seen CMA-ES which is a state-of-the-art method in nonlinear optimization. In my experience for other forms of optimization, MCS (Multi-Level Coordinate Search) tends to work relatively well. You can find a review of derivative-free global optimizers here:

Rios and Sahinidis, "Derivative-free optimization: a review of algorithms and comparison of software implementations", Journal of Global Optimization (2013).

By the way, the EI is analytical so if you want you can also compute its gradient to guide the optimization, but this is not necessary. An effective technique is to run a global optimizer first to find promising solutions and then run a local optimizer to refine it (e.g., a quasi-Newton method such as BFGS, that is fminunc in MATLAB; or fmincon if you have constraints).

Finally, if speed of the optimization of the acquisition function is a factor (which is not the "traditional" BO scenario), I have found decent results by starting with a Latin Hypercube design or a quasi-random Sobol sequence design, then refined with a few steps of a local optimizer from the best point(s); see also @user777 comment. Since this is not the standard BO scenario, I don't have any specific reference that actually uses this method.

Examples of papers that refer to DIRECT or CMA-ES:

Calandra, R., Seyfarth, A., Peters, J., & Deisenroth, M. P. (2015). Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artificial Intelligence, 1-19 (link).
Mahendran, N., Wang, Z., Hamze, F., & Freitas, N. D. (2012). Adaptive MCMC with Bayesian optimization. In International Conference on Artificial Intelligence and Statistics (pp. 751-760) (link).
Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. J. (2014). Sampling for inference in probabilistic models with fast Bayesian quadrature. In Advances in neural information processing systems (pp. 2789-2797) (link).

You can just google "Bayesian optimization" + the desired global optimization algorithm, and you'll find a bunch of papers. Plus, in pretty much every other paper about BO you would find a sentence such as:

[...] BO usually requires an auxiliary global optimizer in each iteration to optimize the acquisition function. It is customary in the BO literature to use DIvided RECTangles (DIRECT) to accomplish such a task. Other global optimization algorithms like CMA-ES could also be applied.

edited Mar 02 '16 at 15:09

answered Mar 01 '16 at 05:44

lacerbi

4,816
16
44

This is actually kind of surprising to me! Can you point me to a representative Bayesian optimization paper that you have in mind that uses DIRECT or CMA-ES? Thanks. – stackoverflax Mar 02 '16 at 12:35
Why is it surprising? This is the norm -- you'll find references to DIRECT or other global optimizers in pretty much every BO paper. It is probably so well known in the community that a few papers don't even bother to mention -- it's just given for granted. I added a few references in my main comment above. – lacerbi Mar 02 '16 at 14:27
This is not necessarily a *good* solution, but I've found it can be cheaper to just evaluate EI at a set of points sampled using Latin Hypercubes in cases where you only need to be nearby the minimum but not necessarily on top of it. – Sycorax Mar 02 '16 at 14:56
@user777: Yes, if speed is at stake I've been using both LH and Sobol quasi-random sequences as initial design (finding a _slight_ advantage with the latter, but it might be problem-dependent), and then running a local optimizer such as BFGS from the best point(s). I will add this to the main comment. – lacerbi Mar 02 '16 at 15:04
One way to justify the ad hoc nature of the LHS approach is that finding the minimum of a stochastic function (surface) is unnecessary because the error in the estimation of the minimum will swamp minute gains in precision. This is a very good answer, though. I'm glad someone else here cares about BO. :-) – Sycorax Mar 02 '16 at 17:12
@user777: Thanks -- although I disagree with your statement. :) What you say might hold in some cases but in my somewhat limited experience (I started using BO less than a year ago) I did observe substantial difference in the efficiency of BO between running a sloppy global optimization of the acquisition function and a "proper" global optimization. Of course the latter is much more time consuming, so the actual wall-clock gain depends on the cost of each function evaluation; that's the only argument in favour of sloppy optimization that I think can be relevant here. – lacerbi Mar 02 '16 at 17:22
@lacerbi I think that we're talking about two different efficiencies, though. For example, if the true minimum is 0 in expectation but has standard error 1, reporting that the minimum is 0.1 in expectation at a nearby with a standard error of 1 both includes the minimum in any reasonable confidence set and saves the headache of trying to precisely determine which point, if any, has the lower expectation. – Sycorax Mar 02 '16 at 17:26
@user777: That to me sounds like the problem of reporting the optimum, which I think it's a different problem (which most BO algorithms do not really approach, by the way). When you are running BO, you would do better by using all the information in your posterior + acquisition function (or change them, if you already don't trust them!). Of course if the sloppy optimization (e.g., LHS) happens to always be near the global optimum of the acquisition function, then well, it's not really sloppy. But if you sample your acquisition function really sub-optimally, expect the BO to do worse. – lacerbi Mar 02 '16 at 17:37
(Surprising just because I didn't see it mentioned in the paper I cited nor do I remember Nando mentioning it during a tutorial a few years ago...details, details.) – stackoverflax Mar 03 '16 at 12:46
Thanks for the illuminating discussion. One more practical question: for Latin Squares, I just do a one-shot prediction and end up with a bunch of predictions of EI and then I just use argmax on that. For DIRECT / CMA-ES I suppose I just need a function that calculates EI for a new location, and then I use the global optimizer on that function? – stackoverflax Mar 03 '16 at 12:49
Yes, that's correct. – lacerbi Mar 03 '16 at 14:52

practical implementation detail of Bayesian Optimization

1 Answers1

Linked