Prediction using Gaussian process given some training data

Question

Assume that some training data consists of $A$ and $B$ where $A$ has the features of each sample and $B$ corresponds to the labels each one assigned to one sample. Having said that, suppose $\hat{A}$ and $\hat{B}$ are the corresponding test data.

Also, Let $K$ be a kernel function. If we know that:

$$ \begin{bmatrix} B\\ \hat{B} \end{bmatrix}=\mathcal{N}(0, \begin{pmatrix} K(A,A) & K(A,\hat{A})\\ K(\hat{A},A) & K(\hat{A},\hat{A}) \end{pmatrix}) $$

The posterior distribution will be like $P(\hat{B}|\hat{A},A,B)=\mathcal{N}(\mu,\Sigma)$. What are $\mu$, $\Sigma$?

Note: Assume that the conditional distribution is in normal form.

My problem is that I do not even know where to start from! How should we find the mean and the variance of the posterior distribution? This question is about Gaussian process regression.

The solution even can be found in this link (Page 4). But it mentions "by the standard rules for conditioning Gaussians", the form of the posterior can be calculated. I do not know what it means by saying that.

The answer is in page 4 of the document. The predictions are given as $m_*$ and $\Sigma_*$ — jcken, Mar 12 '21 at 11:23
@jcken Yes, However it does not mention "why" those values are valid! — Ashton Math, Mar 12 '21 at 13:49
Although a GP is an ''infinite normal'' we only ever observe a finite number of data points and can only predict a finite number of new observations, so the GP prediction problem is reduced the conditional normal equations. Proof of the equations can be found here https://stats.stackexchange.com/questions/30588/deriving-the-conditional-distributions-of-a-multivariate-normal-distribution or by application of Bayes theorem — jcken, Mar 12 '21 at 14:47

score 0 · Accepted Answer · answered Mar 12 '21 at 17:43

Given a multivariate probability density, for example a 2-dimensional random vector with probability density function $\rho\left(x,y\right)$, the conditional probability density function of $x$ for given $y=y^{*}$ is derived by evaluating $y=y^{*}$ in the density probability function. Fixing one random variable to a given value affects the overall probability density distribution for the remaining still-random random variables. Constant re-normalization is also needed since fixing a random value for a given observation is not equivalent than marginalizing the distribution, which would be accounting for every possible $y$.

Regarding Gaussian processes it follows the same procedure. A Gaussian process is a multivariate Gaussian probability distribution representing a prior when a Kernel is provided but not particular restrictions to observations is considered. The case of "predicting" comes by conditioning on previous observations to be restricted to a fixed value or rather by noisy values.

I will provide a approximate derivation to understand the result in your citation. Considering the change of notation $A=K\left(A,A\right)$,$C=K\left(\hat{A},A\right)$,$S=K\left(\hat{A},\hat{A}\right)$ your Gaussian Process is equivalent to

$$ \mathcal{N}\left(\begin{bmatrix} \mu_{x}\\ \mu_{y} \end{bmatrix}, \begin{bmatrix} A & C\\ C^{T} & S \end{bmatrix}\right) = \mathcal{N}\left(\begin{bmatrix} \mu_{x}\\ \mu_{y} \end{bmatrix}, \begin{bmatrix} \tilde{A} & \tilde{C}\\ \tilde{C}^{T} & \tilde{S} \end{bmatrix}^{-1}\right) $$

Where such representation of the covariance matrix is possible since for a gaussian process to exist the same operation has to be achievable. I will assume that you are familiar with the exponential functional dependence of this particular distribution and let's do algebra with its exponent (and omitting the $-1/2$ factor) where the main point of that result is found, leaving the normalization constant aside for the moment being.

$$ \left(\mu_{x}^{T}-x^{T},\mu_{y}^{T}-y^{T}\right)\begin{pmatrix} \tilde{A} & \tilde{C}\\ \tilde{C}^{T} & \tilde{S} \end{pmatrix}\left(\mu_{x}-x,\mu_{y}-y\right) $$

At this point, $y$ is not a random variable anymore. Let's operate to see how the exponent of the distribution modifies,

$$ \left(\mu_{x}^{T}-x^{T}\right) \tilde{A} \left(\mu_{x}-x\right) + \left(\mu_{x}^{T}-x^{T}\right) \tilde{C} \left(\mu_{y}-y\right) + \left(\mu_{y}^{T}-y^{T}\right) \tilde{C}^{T} \left(\mu_{x}-x\right) + \left(\mu_{y}^{T}-y^{T}\right) \tilde{S} \left(\mu_{y}-y\right) $$

From such series the exponent only involving y-magnitudes would factor out as exponential and would remain as constant (keeping in mind here that y is not a random variable anymore). From the remaining terms it is possible to re-operate trying to fit the terms for a new gaussian-exponent like fashion. You already know that the conditional distribution is a Gaussian distribution with different covariance matrix $\Sigma'$

$$ \left(\mu_{x}^{T}-x^{T}\right) \tilde{A} \left(\mu_{x}-x\right) + \left(\mu_{x}^{T}-x^{T}\right) \tilde{C} \left(\mu_{y}-y\right) + \left(\mu_{y}^{T}-y^{T}\right) \tilde{C}^{T} \left(\mu_{x}-x\right) = \\ \left(\mu_{x}^{T}-x^{T}\right) \tilde{A} \left(\mu_{x}-x\right) -\left(\mu_{x}^{T}-x^{T}\right) \tilde{A}\tilde{A}^{-1}\tilde{C} \left(y-\mu_{y}\right) -\left(y^{T}-\mu_{y}^{T}\right) \tilde{C}^{T}\tilde{A}^{-1}\tilde{A} \left(\mu_{x}-x\right)\\ $$

The next step is just re-grouping common factors finally yielding the exponent of a new multivariate gaussian distribution,

$$ \left(\mu_{x}^{T}-x^{T}\right) \tilde{A} \left(\mu_{x}-x\right) + \left(\mu_{x}^{T}-x^{T}\right) \tilde{C} \left(\mu_{y}-y\right) + \left(\mu_{y}^{T}-y^{T}\right) \tilde{C}^{T} \left(\mu_{x}-x\right) = \\ \left(\mu_{x}^{T} -\left(y^{T}-\mu_{y}^{T}\right) \tilde{C}^{T}\tilde{A}^{-1}-x^{T}\right) \tilde{A} \left(\mu_{x}-\tilde{A}^{-1}\tilde{C} \left(y-\mu_{y}\right)-x\right) \\ \approx \log\left( \mathcal{N}\left(\mu_{x}-\tilde{A}^{-1}\tilde{C} \left(y-\mu_{y} \right),\tilde{A}^{-1}\right) \right) $$

Again I want to stress here that the normalization factor has to be recalculated. In fact by definition of conditional probability $P\left(A|B\right) = \frac{P\left(A,B\right)}{P\left(B\right)}$, where $P\left(B\right)$ is a fixed value.

If you truly are interested in Gaussian Processes that book is highly recommended

The last step would be connecting matrices $A$ and $C$ with $\tilde{A}$ and $\tilde{C}$. For that I will redirect you to equations A.11 - A.13 from the book "Gaussian Processes for Machine Learning" (http://www.gaussianprocess.org/gpml/) which cites the inverse of a partitioned matrix as I expressed at the beginning of my answer. In fact another reference for the result I summarized is given in Appendix A.2.

Prediction using Gaussian process given some training data

1 Answers1