11

This is a tough topic for myself to google since having the words optimization and stochastic in a search almost automatically defaults to searches for stochastic optimization. But what I really want to know are what methods exist for optimization of computer models when the computer model output is stochastic, i.e., not deterministic?

For example, if you consider a computer model where there is some unknown function $f(x)$ that represents the output of the computer model, then there exists many statistical methods for solving problems like

\begin{align*} \min&\,\,\,\, f(x)\\ x&\in\mathcal{X} \end{align*}

when $f(x)$ is deterministic. But what happens when $f(x)$ is stochastic? Is there a solution to the problem, or at best can we only solve

\begin{align*} \min&\,\,\,\, \mathbb{E}[f(x)]\\ x&\in\mathcal{X} \end{align*}

where $\mathbb{E}(\cdot)$ is the usual expectation operator.

RustyStatistician
  • 1,709
  • 3
  • 13
  • 35
  • 1
    This is a very interesting question. Optimization of $E[f(x)]$ is the only thing that will really possible. A statistical application related to this question is the MCEM algorithm, where the full likelihood function is only observable with MCMC error on top of it. Similarly, MCMC particle filter algorithms have the same problem. I haven't read up enough on either literature to know what the state of the art methods for answering this are though. – Cliff AB Apr 08 '16 at 17:03
  • 2
    It depends on your goal. $\mathbb{E}[f(x)]$ is only one of many possible choices. In some applications you might want to have a "reliable" solution, not just one which is "good on average". In this scenario you would optimize wrt to some quantile of the distribution of $f(x)$. [Bayesian optimization](https://en.wikipedia.org/wiki/Bayesian_optimization) deals with costly (and sometimes noisy) function evaluations. Check for example [this question](https://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193391#193391). – lacerbi Apr 08 '16 at 19:05
  • 1
    @lacerbi are any of those examples noisy? I think they are only deterministic. – RustyStatistician Apr 09 '16 at 06:28
  • @RustyStatistician: you're right, most examples are deterministic or talk about Bayesian optimization in general. See below for references more focussed on the "noisy" part. – lacerbi Apr 09 '16 at 14:50
  • Do you access to the computer program so you can run it yourself for choosen inputs $x$? Then methods for design of experiments become available for use! Search this site. – kjetil b halvorsen Apr 09 '16 at 19:43
  • @kjetilbhalvorsen yes I do. Although I am not sure how that helps me.... – RustyStatistician Apr 11 '16 at 13:10
  • Search for "response surface", maybe. – kjetil b halvorsen Apr 11 '16 at 14:01
  • Are you talking about "shooting method" or its smarter variants (PSOt,...), model adaptive control, or similar? Can you assert that inputs are bounded/finite? – EngrStudent Sep 30 '16 at 19:09
  • @EngrStudent I don't think I am asserting any methods (I am trying to figure out what methods exist!), but yes I can assert that the inputs are bounded and finite. – RustyStatistician Sep 30 '16 at 19:36
  • If you provide more details of a specific or representative problem you want to solve, you might get more focused advice/answers. How many optimization variables (dimension)? Are variables continuous? What constraints? Are constraints deterministic? Is objective function differentiable (smooth) everywhere, or at least most places? Can you evaluate E(gradient of f(x))? For instance, by applying automatic or symbolic or complex step differentiation to your program? If so, you may be able to use (noisy) gradient based method. Can you do common numbers for different points and get + correlation? – Mark L. Stone Oct 01 '16 at 00:11
  • Note that when I wrote "Is objective function differentiable (smooth) everywhere, or at least most places?" I'm referring to the unknown actual expected value, not the noisy evaluation. I know you can only evaluate E(f(x), and perhaps E(gradient of f(x)), with noise, but the properties of the underlying truth affect how you do the optimization. if you can get high positive correlation of error in noisy evaluation of E(f(x)) for nearby points, that can help a lot. Similarly for E(gradient(x)). Might be good to use same random numbers for objective function and gradient evaluation if possible. – Mark L. Stone Oct 01 '16 at 00:21
  • You may get bonus points if you can do a noisy Hessian evaluation, but I don't think it's common that people would (be able) to do that. But if you can, it might help things. – Mark L. Stone Oct 01 '16 at 00:25
  • @MarkL.Stone there is no application yet. I am approaching this question from a purely methodological point of view. So my assumptions are pretty open. – RustyStatistician Oct 03 '16 at 15:40

3 Answers3

10

(Expanding my comment to a proper answer.)

As I mentioned, it depends on your goal.

The expected value $\mathbb{E}[f(x)]$ is only one of many possible choices for the optimization target. For example, assuming that the $f(x)$ are normally distributed, you could do:

$$ x^\text{opt} = \arg \min_x \left\{ \mathbb{E}[f(x)] + \kappa \sqrt{\mathbb{Var}[f(x)]} \right\} $$ for some $\kappa \in \mathbb{R}$ that manipulates risk-sensitivity. If $\kappa > 0$ you are looking for a robust solution that is likely to be best and discourages large positive fluctuations. Vice versa, a negative $\kappa$ would favour an "optimistic" optimization that looks for large negative fluctuations (negative is good since we are minimizing). You can choose $\kappa$ based on quantiles of the normal distribution (see reference 2 below).

In general, Bayesian optimization (BO, which is related to Gaussian processes and kriging) deals with costly and sometimes noisy function evaluations; although most of the focus of the literature has been on the former part. You can find reviews for Bayesian optimization at this question.

Several people have applied BO to noisy functions. As an introdution to the topic, David Ginsbourger gave a nice talk titled "Variations on the Expected Improvement" at the Workshop on Gaussian Processes for Global Optimization (Sheffield, Sept 17, 2015). You can find his talk here, and all the talks are available on this page (I also recommend all the other talks as an excellent general introduction to BO.)

As references, I would start with the work done by Ginsbourger and colleagues, and Gramacy and colleagues:

  1. Picheny, V. and Ginsbourger, D., 2014. "Noisy kriging-based optimization methods: a unified implementation within the DiceOptim package". Computational Statistics & Data Analysis, 71, pp.1035-1053. (link)

  2. Picheny, V., Ginsbourger, D., Richet, Y. and Caplin, G., 2013. "Quantile-based optimization of noisy computer experiments with tunable precision". Technometrics, 55(1), pp.2-13. (link)

  3. Gramacy, R.B. and Lee, H.K., 2012. "Bayesian treed Gaussian process models with an application to computer modeling". Journal of the American Statistical Association. (link)

  4. Gramacy, R.B. and Apley, D.W., 2015. "Local Gaussian process approximation for large computer experiments". Journal of Computational and Graphical Statistics, 24(2), pp.561-578. (link)

Both Ginsburger and Gramacy have R packages that implement their BO methods, respectively DiceOptim and tgp.

lacerbi
  • 4,816
  • 16
  • 44
  • 1
    Where is $k$ in your answer, or do you mean $\kappa$? – RustyStatistician Apr 09 '16 at 18:11
  • 1
    One more algorithm, which I have not used* but wins in the amusing name department, is [SNOBFIT](http://www.mat.univie.ac.at/~neum/software/snobfit/). (* The author *is* notable in the optimization community however, and the software did OK on a [deterministic benchmark](https://scholar.google.com/scholar?cluster=13996631775177561404), so the recommendation isn't *just* based on the cool name!) – GeoMatt22 Sep 17 '16 at 04:05
4

Let's say we're in a discrete probability space so that $f(x) \in \mathcal{R}^n$. Intuitively, you need some function $U: \mathcal{R}^n \rightarrow \mathcal{R}$ so you can optimize $U(f(x))$. You can only optimize a single objective!

Optimizing a single objective function may sound quite constraining, but it is not! Rather a single objective can represent incredibly diverse preferences you may have over what is a better or worse solution.

Skipping ahead, a simple place to start may be choosing a random variable $\lambda$ then solving:

$$ \begin{array}{*2{>{\displaystyle}r}} \mbox{minimize (over $x$)} & E\left[\lambda f(x) \right] \\ \mbox{subject to} & x \in X \end{array} $$ This is a simple linear re-weighting of $E[f(x)]$. Anyway, here's an argument for why collapsing multiple objectives to a single objective is typically ok.

Basic setup:

  • You have a choice variable $x$ and a feasible set $X$.
  • Your choice of $x$ leads a random outcome $\tilde{y} = f(x)$
  • You have rational preferences $\prec$ over the random outcome. (Basically, you can say whether you prefer one random outcome $\tilde{y}$ to another.)

Your problem is to choose $x^*\in X$ such that:

$$ \nexists_{x \in X} \quad f(x^*) \prec f(x) $$ In English, you wan to choose $x^*$ so that no feasible choice $x$ leads to an outcome preferred to $f(x^*)$.

Equivalence to maximizing utility (under certain technical conditions)

For technical simplicity, I'll say we're in a discrete probability space with $n$ outcomes so I can represent random outcome $\tilde{y}$ with a vector $\mathbf{y} \in \mathcal{R}^n$.

Under certain technical conditions (that aren't limiting in a practical sense), the above problem is equivalent to maximizing a utility function $U(\mathbf{y})$. (The utility function assigns more preferred outcomes a higher number.)

This logic would apply to any problem where your choice leads to multiple outcome variables.

$$ \begin{array}{*2{>{\displaystyle}r}} \mbox{maximize (over $x$)} & U(f(x)) \\ \mbox{subject to} & x \in X \end{array} $$

Giving more structure to utility function $U$: Expected Utility hypothesis:

If we're in a probabilistic setting and we accept the Neumann-Morgernstern axioms, the overall utility function $U$ has to take a special form:

$$U(\mathbf{y}) = E[u(y_i)] = \sum_i p_i u(y_i) $$ Where $p_i$ is the probability of state $i$ and $u$ is a concave utility function. The curvature of $u$ measures risk aversion. Simply substituting this specialized form of $U$ you get:

$$ \begin{array}{*2{>{\displaystyle}r}} \mbox{maximize (over $x$)} & \sum_i p_i u(y_i) \\ \mbox{subject to} & x \in X \\ & \mathbf{y} = f(x) \end{array} $$

Observe that the simple case $u(y_i) = y_i$ is maximizing the expected value (i.e. no risk aversion).

Another approach: $\lambda$ weights

Another thing to do is: $$ \begin{array}{*2{>{\displaystyle}r}} \mbox{maximize (over $x$)} & \sum_i \lambda_i y_i \\ \mbox{subject to} & x \in X \\ & \mathbf{y} = f(x) \end{array} $$

Intuitively, you can choose weights $\lambda_i$ that are larger or smaller than the probability $p_i$ of a state occurring, and this captures the importance of a state.

The deeper justification of this approach is that under certain technical conditions, there exists lambda weights $\boldsymbol{\lambda}$ such that the above problem and the earlier problems (eg. maximizing $U(f(x))$) have the same solution.

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • But in this setup not all utility functions lead to the same answer correct? – RustyStatistician Sep 30 '16 at 17:07
  • And are there typical choices for utility functions? My problem is a stochastic computer simulator, which is actually a blackbox simulator, so I know no information about the underlying mechanics so can I even possibly assign it a utility function? – RustyStatistician Sep 30 '16 at 17:12
  • You need to think through the logic of your problem, what constitutes a good result, and then find some objective function that assigns better outcomes a higher number. (Or equivalently, you can set this up as a minimization problem and assign worse outcomes a higher number. eg. minimize some notion of squared error etc..) – Matthew Gunn Sep 30 '16 at 17:14
4

The current answers focus on the proper (mathematical) definition of a stochastic optimization target - I want to provide a somewhat more applied perspective.

This problem occurs frequently when fitting stochastic models, e.g. using informal or synthetic likelihoods. Reference (1) provides you with a list of options that can be used to define the distance between a stochastic model and the data.

After having defined your target in this way, the issue that remains is finding the optimum of some mean of a noisy target. There are two routes to go, a) optimization, and b) MCMC sampling. You were asking specifically about optimization, but I want to bring in the MCMCs because they are often better behaved for this task.

a) If you stay with optimization, you need to make sure that you don't get stuck and that the optimizer can deal with a stochastic target. Chapter 4 in the PhD thesis of Matteo Fasiolo gives some hints, see (2).

b) As we note in (1), MCMCs are generally more robust against a stochastic target - under mild conditions regarding the distribution of the noise, the MCMC will average the noise away, and the sampled target will be indistinguishable from a non-noisy target with mean of the noisy target. However, MCMCs too can get stuck when encountering an evaluation that is particularly good. What you MUST NOT DO now is getting the following "obvious" idea: simply calculate both current and proposed value in each MCMC iteration. The keyword to look up here is "pseudo-marginal", see also here and here.

1) Hartig, F.; Calabrese, J. M.; Reineking, B.; Wiegand, T. & Huth, A. (2011) Statistical inference for stochastic simulation models - theory and application. Ecol. Lett., 14, 816-827.

2) Fasiolo, M. (2016) Statistical Methods for Complex Population Dynamics. University of Bath

Florian Hartig
  • 6,499
  • 22
  • 36