2

I would like to fit a Gaussian Process Regressor from sk-learn to predict values of an the objective function in order to explore behaviour and interplay between inputs.

Data points are difficult to obtain (hours of computation on hundreds of cores). Thus I would like to use an effective DoE to get data to fit the regression model.

Which DoE method should I use? How many data points are needed based on number of inputs?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
voltej
  • 21
  • 1
  • Can you tell us somewhat more? Like the number and nature of input variables (continuous or categorical/how many levels?). To what use will the surrogate model be put? Some similar questions: https://stats.stackexchange.com/questions/167848/response-surface-methodology-rsm-for-a-mathematical-model https://stats.stackexchange.com/questions/99473/how-do-i-choose-which-design-of-experiments-method-to-use https://stats.stackexchange.com/questions/335652/space-filling-vs-d-a-i-etc-optimal-experiment-design – kjetil b halvorsen Apr 05 '18 at 13:21
  • Also see: https://en.wikipedia.org/wiki/Surrogate_model – kjetil b halvorsen Apr 05 '18 at 13:31
  • There would be 2-7 input variables, all of them continuous. – voltej Apr 06 '18 at 10:35
  • Thanks for the links, but they are not answering the question. – voltej Apr 06 '18 at 10:37
  • I'm looking for answer like: "Based on this theory/extensive study use this number of point for this number of variables, then check the stats of the model and add this number of infill points." – voltej Apr 06 '18 at 10:39
  • Well, I understand that, but you have not given enough information for such an answer! Is the computer simulation deterministic or stochastic? If stochastic, what is the variance? If deterministic, smooth or wiggly? Should the surrogate model be used for system optimization, or some other use? ... – kjetil b halvorsen Apr 06 '18 at 11:31
  • Since you are doing computer experiments, you do not need to decide on the number of runs beforehand, so you can do some runs, analyze, then decide some further infill runs, ... I would look at https://www.amazon.co.uk/Evolutionary-Operation-Statistical-Improvement-Classics/dp/0471255513/ref=sr_1_1?s=books&ie=UTF8&qid=1523015065&sr=1-1&keywords=evolutionary+operation – kjetil b halvorsen Apr 06 '18 at 11:47
  • I see. the simulation is deterministic and smooth. We want the model to provide a quick feedback to engineers who propose small design changes of a geometry. – voltej Apr 06 '18 at 11:52
  • Actually, I want to provide a quick feedback, so I'm looking for a process to make a reliable regression model in advance. – voltej Apr 06 '18 at 11:55
  • 1
    Is your interest merely to "explore" the behavior, or do you specifically want to **optimize** the function? If your focus is optimization, there are a number of tools developed for expensive objective functions. https://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193310#193310 – Sycorax Apr 06 '18 at 15:38

2 Answers2

1

Let me try. I have little practical experience with such experiments. Lets say you have 5 continuous input variables. First, decide on the range of interest for each variable. Then, for each variable, take max and min of range as first inputs. This might make a problem if the behavior of the function is untypical near the edges, so maybe first make a somewhat smaller range, and depend on continuity to extend outside the range (at least stay away from singularities at first). Run a fractional factorial experiment, maybe a $2^{5-2}$ which have eight runs, plus a centerpoint (a centerpoint is a cheap way to get some information on curvature). That gives 9 points for a first model. So few points only makes possible to use a linear model, fitting interactions will need more points, unless you have some strong prior information to use in a Bayes model. If time permits, maybe extend that to a $2^{5-1}$-design which will give also information on interactions (still with centerpoint).

Use this first model to help find point of interest for further runs.

I would at least start along some such lines, but maybe first read some papers such as Use of Kriging Models to Approximate Deterministic Computer Models (Kriging is just another name for gaussian process models), the books Engineering Design via Surrogate Modelling and Evolutionary Operation: Statistical Method for Process Improvement seems directly relevant.

Also some ideas and references at Function Approximation vs. Regression. In some cases you could look at ideas of optimal experimental design, see this presentation

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
1

The answer here depends on whether you want to explore the space by collecting a bunch of data points at once, or whether you'd like to quickly converge to an optimal point in the space.

If you want to collect a bunch of points at once that are informative about the rest of the space, then you want to maximize entropy of the points you collect. If you know what covariance function you're using in your GP model, then you can calculate the entropy from the training covariance before you actually make the measurements at those points. The entropy of a multivariate Gaussian is:

$${\frac {1}{2}}\ln \operatorname {det} \left(2\pi \mathrm {e} {\boldsymbol {\Sigma }}\right)$$

where $\Sigma$ is the covariance.

On the other hand, if you want to find an optimal point, then you want a Bayesian black box optimization algorithm. One simple example is the upper confidence bound algorithm:

  1. Collect a data point
  2. Train a GP regression model
  3. Find the point in your search space that maximizes the predicted mean + standard deviation.
  4. Measure that point.
  5. Repeat 2-3 until some stopping condition is met.

I believe one of the other commenters linked you to more resources on Bayesian black box optimization. scikit-optimize implements some of these with Gaussian processes.

Kevin Yang
  • 742
  • 3
  • 8
  • Thanks. The first part actually answers my question but I needed to dig deeper into the subject in order to understand it. Will probably post my own answer in order to make it more clear. – voltej Apr 13 '18 at 13:31