9

Actually I thought Gaussian Process is a kind of Bayesian method, since I read many tutorials in which GP is presented in Bayesian context, for example, in this tutorial, just pay attention to page 10.

Suppose the GP prior is $$\pmatrix{h\\ h^*} \sim N\left(0,\pmatrix{K(X,X)&K(X,X^*)\\ K(X^*,X)&K(X^*,X^*)}\right)$$, $(h,X)$ is for the observed training data, $(h^*,X^*)$ for the test data to be predicted. And the actually observed noisy output is $$Y=h+\epsilon$$, where $\epsilon$ is the noise, $$\epsilon\sim N(0,\sigma^2I)$$. And now as shown in the tutorial, we have $$\pmatrix{Y,Y^*}=\pmatrix{h\\ h^*}+\pmatrix{\epsilon\\ \epsilon^*}\sim N\left(0,\pmatrix{K(X,X)+\sigma^2I&K(X,X^*)\\ K(X^*,X)&K(X^*,X^*)+\sigma^*I}\right)$$, and finally by conditioning on $Y$, we could have $p(Y^*|Y)$, which is called as predictive distribution in some books or tutorials, but also called posterior in others.

QUESTION

  1. According to many tutorials, the predictive distribution $p(Y^*|Y)$ is derived by conditioning on $Y$, if this is correct, I don't understand why GP Regression is Bayesian? Nothing about Bayesian is used in this conditional distribution derivation, right?

  2. However, I don't actually think the predictive distribution should be just the conditional distribution, I think it should be $$p(Y^*|Y)=\int p(Y^*|h^*)p(h^*|h)p(h|Y)dh$$, in the above formula, $p(h|Y)$ is the posterior, right?

avocado
  • 3,045
  • 5
  • 32
  • 45

2 Answers2

6

On your first question: GPs are Bayesian because they involve constructing a prior distribution (here over functions directly rather than over parameters) and updating this distribution by conditioning on the data. The Gaussian part just makes the resulting posterior friendlier to work with than it might be otherwise.

On your second question: you might ask how your last equation is realised by the 'even simpler approach' described in section 4.2. Things are indeed being integrated out there.

conjugateprior
  • 19,431
  • 1
  • 55
  • 83
  • I'm not sure what you exactly mean by "Gaussian part just makes the resulting posterior friendlier"? Do you mean, if it's not the Gaussian distribution, the derivation of posterior would be more complicated? – avocado Feb 03 '14 at 09:21
  • By my second question, I mean, to derive $p(Y^*|Y)$, I'm supposed to find out $p(h|Y)$ first, and this is the posterior of $h$ after observing $Y$. Then just like parametric Bayesian averaging, I compute the integral $\int p(Y^*|h^*)p(h^*|h)p(h|Y)dh$, which is the predictive distribution of $Y^*$, it averages over $p(h|Y)$. However, the tutorial I read doesn't do as I describe, they just go straight forward to compute the conditional distribution $p(Y^*|Y)$ from the joint $p(Y,Y^*)$, and this is why I think it's not Bayesian. I mean computing the Gaussian conditional probability is Bayesian? – avocado Feb 03 '14 at 09:32
  • by 'friendlier' I mean 'the marginal and conditional distributions of interest are available in closed form and moreover computable by straightforward matrix algebraic operations' – conjugateprior Feb 03 '14 at 10:32
  • on the second question: Perhaps http://www.gaussianprocess.org/gpml/chapters/RW2.pdf (through section 2.2 at least) is helpful. This explicitly compares the normal parametric Bayesian regression approach to the GP one with no noise (your posterior over h) and then shows how it extends to a posterior over a new data point. – conjugateprior Feb 03 '14 at 10:48
  • On the more general question: 'computing a Gaussian conditional probability' is in itself neither Bayesian nor anything else. The interpretation of a GP as a prior over functions any finite number of observations from which are jointly Normal is the Bayesian part. – conjugateprior Feb 03 '14 at 10:50
  • Note that for predictive purposes, the postulation of parameters, the decomposition of y into a mean function plus additive noise, fitted values etc. are all distractions. If you could, you'd want to be able to say: "I think the function I'm predicting values for is *this* smooth, but noisy, I have these observations from it, and I want a distribution for the value of the next one". The GP approach is Bayesian because it translates that request to: "you have a prior over functions from which these data points came and want a posterior for the next one". – conjugateprior Feb 03 '14 at 10:59
  • +1, thanks so much for you kind help. I actually read the GPML book you linked, and it also does the same thing, it derives the predictive distribution by computing the Gaussian conditional. As you said, computing Gaussian marginal isn't Bayesian at all, and the very thing which makes GP a Bayesian approach is that, *it specifies a Prior, and the Gaussian conditional acts as the Posterior*, right? – avocado Feb 03 '14 at 11:20
  • BTW, as you said in you answer, *"the even simpler approach"* is Gaussian conditional as the posterior, right? And by *Things are indeed being integrated out there*, you mean the computation of Gaussian conditional here is indeed the same as the integral I wrote in the post? – avocado Feb 03 '14 at 11:36
0

Seems the question is not settled totally. I was also frustrated about it until hit the below:

"Posterior probability is just the conditional probability that is outputted by the Bayes theorem. There is nothing special about it, it does not differ anyhow from any other conditional probability, it just has it's own name."

The original answer is here