2

I have a black box system which accepts a data instance $\boldsymbol{x}$ and outputs a corresponding $y$. I don not know the exact mathematical formula of the system. What I want to do is maximizing $y$ by tweaking $\boldsymbol{x}$. Since I don't have the exact formula of the system, I cannot use gradient descent to do this. One idea in my mind is to add random noise to $\boldsymbol{x}$ which results in $\hat{\boldsymbol{x}}$, and then sample corresponding $\hat{y}$ from the system. By doing this I could use Bayesian Optimizaton to find an $\hat{\boldsymbol{x}}$ which maximizes $\hat{y}$.

My question is: Is there any other methods to tackle this problem? Can you point me out any paper/book/tutorial?

Thank you

Sycorax
  • 76,417
  • 20
  • 189
  • 313
MarsPlus
  • 151
  • 4
  • 1
    Some of the content of this thread will apply (admittedly most of it is about BO). http://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193310#193310 – Sycorax Mar 14 '17 at 17:02
  • @Sycorax Thank you very much! I will look at that answer. – MarsPlus Mar 14 '17 at 17:03
  • 2
    Also, the dirt-simple approach of random search has some nice properties. I can't find the specific paper that compares it to BO, but such a paper does exist. If I find it, I'll post an answer. – Sycorax Mar 14 '17 at 17:09
  • @Tavrock Thanks for your comment. Can you clarify what a uniform set of $x$ is? Does that mean I should add white gaussian noise to generate such $x$? – MarsPlus Mar 14 '17 at 17:11
  • 1
    @Tavrock That's essentially what BO does, but it tries to choose the points $x$ a little more efficiently than just a uniform grid, since black box evaluations are often expensive. – Danica Mar 14 '17 at 17:25
  • Oh I see. I am sorry I forgot to mention that the input $\boldsymbol{x}$ is not a scalar. It is a vector of real-valued features. – MarsPlus Mar 14 '17 at 17:25
  • Knowing that you are dealing with $Y=f (\vec {x}) $ verses $Y=f (x) $ is a huge difference. What is the size of your input vector? – Tavrock Mar 14 '17 at 17:37
  • Oh, duh! And PSO can compare nicely to BO. See: http://stats.stackexchange.com/questions/194056/advantages-of-particle-swarm-optimization-over-bayesian-optimization-for-hyperpa?rq=1 – Sycorax Mar 14 '17 at 17:45
  • @Tavrock In my case, the size of the input vector generally does not exceed 20. – MarsPlus Mar 14 '17 at 17:49
  • @Sycorax Thanks for the link. I will have a look at that. – MarsPlus Mar 14 '17 at 17:49

1 Answers1

3

Derivative-free optimization methods solve these type of problems where you can view your objective function as a black box. Bayesian optimization is one type of derivative-free method. It helps if you know any other structural information about your objective function. E.g., can you assume convexity? Surrogate models can used to estimate approximate gradient, e.g., take a few points, fit a quadratic function, and use the model to do a local search in the direction of its negative gradient. There is a vast amount of literature on derivative-free methods (for a review of algorithms, see http://thales.cheme.cmu.edu/dfo/comparison/dfo.pdf). Another example of a derivative-free method is the Nelder-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method .

Of course, because very little information is known about the objective function, it is difficult to solve problems with high input dimension (compared to regular optimization problems) -- you would need to consider the number of function evaluations you can afford and the size of your input variable you're optimizing.

jagdish
  • 326
  • 1
  • 5