From what I understand, the 'standard' approach to Bayesian Optimisation uses a Gaussian process for the prior (as opposed to more recent proposals like TPE or Bayesian Optimisation with random forests; please correct me if any of this is wrong).
However, I'm having a hard time finding out who first proposed this. My statistical background is very limited, so I find it difficult to evaluate to which extent old papers use essentially the same approach, especially when they use different terminology.
A couple of candidates are:
Kushner 1962
A Versatile Stochastic Model of a Function of Unknown and Time Varying Form
Journal of Mathematical Analysis and Applications
Kushner 1964
A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise
Journal of Basic Engineering
Močkus 1975
On Bayesian methods for seeking the extremum
Optimization Techniques IFIP Technical Conference
Žilinskas 1978
On statistical models for multimodal optimization
Series Statistics
O'Hagan 1978
Curve Fitting and Optimal Design for Prediction
Journal of the Royal Statistical Society. Series B (Methodological)
As well as earlier and later works by Močkus and Žilinskas, some of it in Russian.
In addition, there's Krige and the Kriging literature, which as far as I can tell were not concerned with optimisation?
Of these, only O'Hagan mentions Gaussian processes explicitly, Kushner and Žilinskas discuss Gaussian random variables, functions and fields, but I'm not sure whether that's the same as a Gaussian process. (Wikipedia says that Gaussian processes and one-dimensional Gaussian random fields are the same thing.)
My impression is that the earliest forms of Bayesian optimisation modelled the prior distribution with Wiener processes/Brownian motion. According to Wikipedia, Wiener processes are Gaussian processes, so does that mean that this is simply a stronger assumption, that later proved unnecessary? Is the choice between Wiener processes and Gaussian processes significant for the performance of Bayesian optimisation, or was the switch motivated more by practical/pragmatic reasons?
A more specific form of my question is: when I use Bayesian optimisation with Gaussian processes in a modern software package like GPyOpt or Emukit, what's the literature underpinning that?