Multi-start optimization

Question

I am performing $k$-fold cross-validation for model selection via maximum-likelihood estimation.

At the moment I am using a standard optimizer (fmincon in MATLAB). Since the likelihood landscape is complex with a myriad of local minima, I need to restart the optimization hundreds or even thousands of times to explore the landscape and find what seems to be the global optimum. For this reason, I was trying to find methods to improve the efficiency of the optimization (see also this post for an orthogonal attempt at obtaining a speed-up).

My strategy so far is to restart the optimizer from different starting points chosen from a space-filling quasi-random SOBOL sequence - the logic is pretty much the same as having a Latin hypercube design, but it is easy to add new points to the sequence. This strategy is dumb in that it treats each restart as independent and does not use any information about the objective function acquired during previous optimizations.

I know that there are "semi-smart" multi-start methods for global optimization that choose the starting points in an adaptive way. For example, some multi-start methods somehow estimate the basin of attraction of the function minima and avoid starting from an existing basin. I have skimmed some of the papers but I am not sure how effective these methods would be in practice.

Does anybody have experience with some kind of multi-start method, and can recommend it?

Just to prevent some answers: I've tried other local and semi-local optimizers, such as CMA-ES, to no good. I've also tried some global optimizers, such as MCS, to no good. I am pretty familiar with a variety of local and global optimizers (see my answer to this post), but I am open to new suggestions. Bayesian Optimization (see my answer to this post) is not on the plate for this problem, though, because BO is way too slow.

Just out of curiosity, what is the bottleneck with BO? Too many hyper-parameters? — Sycorax, Mar 08 '16 at 17:34
The target function is very fast -- each function evaluation for the most complex model takes less than 5 ms, and much less for simpler models. I can easily allocate a budget of $10^6$ function evaluations per dataset and model. I was thinking of using BO just for the multi-start part (which is different than standard BO). I am not aware of any work that does that. I've been thinking a bit about it and that approach would be highly non-trivial and still computationally intensive (e.g., the landscape is very non-stationary, you need to find a way to handle a large number of inputs, etc.). — lacerbi, Mar 08 '16 at 17:51
Yeah, the $O(n^3)$ GP complexity would seem to bite pretty hard in that context. — Sycorax, Mar 08 '16 at 17:54
Yep -- unless you have structure to exploit in the kernel (Kronecker, Toeplitz), it's hard to go above $10^3$ data points. You can use inducing-points methods etc. or perhaps some heuristic (clustering, etc.) to keep only a subset of training points, but training of the GP is still going to be slow. I do think that a "Bayesian multi-start" method might be interesting to develop but that would be a project in itself. — lacerbi, Mar 08 '16 at 18:03
Hello @lacerbi. Did you find any good multi-start solutions? — Shadi, Oct 21 '17 at 08:06
@shadi Not much besides a lot of random or quasi-random starting points, possibly warped to match some known prior over the space of solutions. It's a problem I would like to look into a bit more in the future. — lacerbi, Oct 21 '17 at 08:45
Just found that optimizing initial weights is not popular: [reference](https://github.com/fchollet/keras/issues/2477) — Shadi, Oct 23 '17 at 07:56

Multi-start optimization

0 Answers0