Gaussian Process inference for regression: can we derive new evaluation from just the new input points?

Question

I'm getting lost on the inference part of Gaussian process regression:

We start with a dataset $X \in \mathbb{R}^{n \times d}$ with corresponding observations $f = f(X)$ and some test data points $X_* \in \mathbb{R}^{n_* \times d}$ for which we want to infer function values $f_* = f(X_*)$.

The $GP$ defines a joint distribution for $p(f, f_*|X, X_*)$

$$\begin{bmatrix}\textbf{f}\\ \textbf{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\mu \\ \mu_*\end{bmatrix}, \, \begin{bmatrix}K & K_*\\ K_*^T & K_{**}\end{bmatrix}\right)$$

with ${\mu} = m(X)$, $\mu_* = m(X_*)$, $K = K(X,X) \, , \, K_* = K(X, X_*) \, , \, K_{**} = K(X_*, X_*)$.

then we apply rules of Multivariate normal conditioning to infer

$$p(f_* | f, X,X_*) \sim \mathcal{N}(\mu_*+K_*^TK^{-1}(f-\mu) \, , \,K_{**}-K_*^TK^{-1}K_*)$$

Until here all makes sense since any subset of uncountable family of gaussians has still a joint gaussian distribution, but:

Are information on the right hand side all known to us, i.e. deriving directly from the test set $X_*$ for which the evaluation $f(X_*)$ are unknown?

What I don't understand, for example, is that by definition:

$$\mu_* = m(X_*) = \mathbb{E}[f(X_*)]$$

and while for known tuples $(x, f(x))$ this actually makes sense, instead we don't know the values $f(X_*)$! What am I missing here?

many thanks,

James

Tim · Accepted Answer · 2021-07-19T11:52:18.853

It's the same as with any other predictive model, say linear regression. At training time, you know both the inputs $X$ and the outputs $\mathbf{f}$, so you can learn from them. At prediction time, new inputs arrive $X_*$ and you make predictions for $\mathbf{f}_*$ by plugging it into the model. In the Gaussian process, you calculate the mean $\boldsymbol{\mu}_*$ and covariance $\boldsymbol{\Sigma}_*$ as functions of the data. You can think of the kernels $K$ as a kind of distance matrices, where the prediction is made by finding a value of $\mathbf{f}_*$ such that is close to what we observed in the training data, where the distances inform you as well about the uncertainty (large uncertainty for points unlike the ones you saw in training data). Given $\boldsymbol{\mu}_*$ you can treat it as a point prediction for $\mathbf{f}_*$ (as we do in linear regression), you can use it together with $\boldsymbol{\Sigma}_*$ to draw prediction interval over it, or you can use both $\boldsymbol{\mu}_*$ and $\boldsymbol{\Sigma}_*$, plug them into a Gaussian random number generator to draw samples of the possible values of $\mathbf{f}_*$. The left-hand side values are the random variables following the distribution defined on the right-hand side.

Here and here you can find nice tutorials including code examples in Python and here is a simple notebook in Julia I prepared some time ago. I recommend looking at the code or implementing it by yourself. The math in Gaussian process literature can get dense and it is often easier to understand when you code it.

In second link you provided it is assumed that $(f , f_*)$ is jointly Gaussian with **mean \mu = 0** . So if it is assumed that the mean for the observed data is 0, is it still reasonable to assume $0$ mean for new points as well? — James Arten, Jul 19 '21 at 12:21
@JamesArten as with all assumptions, there are no globally applicable ones. Though, this is a common assumption when using GPs. To have mean = 0 you just need to center the data (subtract the mean), for test data this usually means subtracting the training mean and hoping it is a good approximation of the test mean. — Tim, Jul 19 '21 at 12:25
ok and if we do not normalize anything we just use the same mean of training data for getting the mean of predictive distribution? — James Arten, Jul 19 '21 at 12:29
@JamesArten in such a case you don't have better options. Again, in linear regression, you learn the intercept from the training data and apply it to test data, where the intercept serves as a mean correction. — Tim, Jul 19 '21 at 12:31

Gaussian Process inference for regression: can we derive new evaluation from just the new input points?

1 Answers1