2

I am currently studying the textbook Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. Chapter 1 Introduction says the following:

enter image description here The specification of the prior is important, because it fixes the properties of the functions considered for inference. Above we briefly touched on the mean and pointwise variance of the functions. However, other characteristics can also be specified and manipulated. Note that the functions in Figure 1.1(a) are smooth and stationary (informally, stationarity means that the functions look similar at all $x$ locations). These are properties which are induced by the covariance function of the Gaussian process; many other covariance functions are possible.

I don't understand what is meant by the following:

These are properties which are induced by the covariance function of the Gaussian process; many other covariance functions are possible.

What is meant by this idea that a covariance function of a Gaussian process "induces" properties? And what is the connection to stationarity?

The Wikipedia article for covariance function doesn't seem to have any information on this.

And here is a related question (about stationarity) from the same textbook: Stationary function

The Pointer
  • 1,064
  • 13
  • 35
  • 2
    The book is rather hard to digest for the first time, I’d recommend you check also other resources & write your own code to try to understand it. You can find some examples & further references in my notebook https://nbviewer.jupyter.org/github/twolodzko/LearningJulia/blob/master/Gaussian-Process.ipynb – Tim Dec 26 '20 at 08:57
  • @Tim Interesting. Thanks! – The Pointer Dec 26 '20 at 09:13
  • For the smoothness part it is related to: https://stats.stackexchange.com/questions/316297/ https://stats.stackexchange.com/questions/353695/ https://stats.stackexchange.com/questions/180823/ https://mathoverflow.net/questions/280188/ – Sextus Empiricus Dec 27 '20 at 14:11

2 Answers2

1

The covariance function of the Gaussian process is not just a covariance, but a pretty specific covariance function. It is defined in terms of kernels, that can be thought as measuring distance between the points. This definition forces (“induces“) the function outputs to be similar for similar points, because they are closer to each other, hence forced by the covariance function to be more correlated.

Tim
  • 108,699
  • 20
  • 212
  • 390
0

GP takes a function space view, meaning it tries to find the function that maps your features to your target(more approximation rather than mapping). The difference to generic models like linear regression is that instead of imposing structure to the function(in linear regression, the imposed structure is the function being linear on its parameters). GP defines a function space and looks for the best fitting function rather than the best fitting parameters to the problem at hand.

GP also assumes that the best fitting function is a random variable, and outputs follow a joint normal distribution. As a side note, this is not the only difference between the two models, they are entirely different, but the point mentioned above is the relevant difference to question.

Since the normal distribution is only characterized by its covariance matrix and mean vector, since mean vector and covariance matrix estimated using kernel function(covariance function), we have the result "covariance function of a Gaussian process induces properties." Meaning depending on the specific kernel function we choose, the properties of distribution, thus the properties of GP change.

In a nutshell, for stationarity, we can define it as: If the properties of distribution do not change over time, then it is stationary. If it does, then it is not stationary. Note that this is not a rigorous definition.

Connection to stationarity comes from kernel function being stationary or not. Meaning if the problem at hand is not stationary, you cannot use a stationary kernel function. Thus the choice of kernel function depends on the problem.

Edit:

Please refer to Chapter 2.2 of Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams.

Below function is function 2.19 in the book:

$\mathbf{f_*}|X_*,X,\mathbf{f} \sim N(K(X_*, X)K(X,X)^{-1}\mathbf{f}, K(X_*, X_*)-K(X_*, X)K(X, X)^{-1}K(X, X_*))$

Where:

$X$: Training data input (features)

$X_*$: Test data input (features)

$\mathbf{f}$: Training data output (target)

$\mathbf{f_*}$: Test data output (target), the value we are trying to predict.

$K(., .)$: Kernel function

The above distribution's mean vector gives your point predictions, and the covariance matrix gives your uncertainty estimates. As you can see from the above formulation, both the mean vector and covariance matrix are functions of the kernel function. Thus, "mean vector and covariance matrix estimated using kernel function(covariance function)".

I believe if you keep reading the book, everything will become more apparent in each chapter.

CheeseBurger
  • 581
  • 1
  • 9