Explanation of Equation 5.3 from Gaussian Processes for Machine Learning

Question

I am currently reading through C. E. Rasmussen & C. K. I. Williams' Gaussian Processes for Machine Learning and was going through chapter 5. I could not exactly understand the derivation of equation 5.3. It would be helpful if any of you can explain how the equation was derived.

My attempt was as follows: \begin{equation*} P(w |X, \theta,y, H_i) = \frac{P(\theta | X, y, w, H_i).P(y | X, w, H_i).P(X|w, H_i).P(w|H_i).P(H_i)}{P(X, \theta, y, H_i)} \end{equation*} In the numerator, the first and the third terms are 1. I do not know how the authors derived the numerator.. Any help would be greatly appreciated. Here $w$, $\theta$, $y_i$, $X$ and $H_i$ are respectively the model parameters, the model hyperparameters, the labels, the data and the hypothesis set.

If you're asking what "Bayes' rule" is, please see https://stats.stackexchange.com/search?q=bayes+theorem. If not, what exactly do you want to see derived from what set of assumptions and definitions? — whuber, Oct 22 '19 at 17:12
@whuber, I am familiar with Bayes rule. I tried deriving the entire thing from scratch. What I could not do was to bring out the LHS exactly in the required form of the RHS. The issue is that once theta is conditioned (P( ABCDtheta) = P(theta|ABCD)P(ABCD), it doenst appear later as only the ABCD remains as in here. I think once you get a "higher" order term without theta (P (y|X, w, H_i), you can't get theta later. I wanted to know how this was done. — Deepak Narayanan, Oct 22 '19 at 17:18

score 1 · Accepted Answer · answered Oct 22 '19 at 18:10

First, use a conditional Bayes' rule where we keep conditioning on $X$, $\theta$, and $\mathcal H_i$, and only swap $\mathbf w$ and $\mathbf y$: $$ p(\mathbf w \mid \mathbf y, X, \boldsymbol\theta, \mathcal H_i) = \frac{p(\mathbf y \mid \mathbf w, X, \boldsymbol\theta, \mathcal H_i) \; p(\mathbf w \mid X, \boldsymbol\theta, \mathcal H_i)}{p(\mathbf y \mid X, \boldsymbol\theta, \mathcal H_i)} .$$

Now, (5.3) will follow from this if we establish both of \begin{gather*} p(\mathbf y \mid \mathbf w, X, \boldsymbol\theta, \mathcal H_i) = p(\mathbf y \mid \mathbf w, X, \mathcal H_i) \tag{1} \\ p(\mathbf w \mid X, \boldsymbol\theta, \mathcal H_i) = p(\mathbf w \mid \boldsymbol\theta, \mathcal H_i) \tag{2} .\end{gather*}

But remember what these things are:

$X$ is the matrix of training inputs
$\mathbf y$ is the vector of training outputs
$\mathbf w$ is the vector of parameters, e.g. the weights of a neural network
$\boldsymbol\theta$ is the vector of hyperparameters, e.g. the regularization weight
$\mathcal H_i$ is which of several discrete model classes we're considering using

So $\theta$ only determines how we choose $\mathbf w$ given the other stuff. If we're conditioning on $\mathbf w$ itself, then there's no added information by also conditioning on $\boldsymbol\theta$: $\mathbf y$ is independent of $\boldsymbol\theta$ given $\mathbf w$. Thus (1) holds.

(2) assumes that $\mathbf w$ doesn't depend on $X$ when we don't also have $\mathbf y$: that is, we know which points we're training on, but not the labels. This is actually kind of an assumption, one which I think was more reasonable to make in 2006 when this book was written than it would be today. For instance, if we're running (stochastic) gradient descent on a linear model, then we actually know that the difference between our final weights $\hat{\mathbf w}$ and the initial weights $\mathbf w_0$ will be in the span of the data $X$ – if $\mathbf w_0$ is small and the dimension of the data is much higher than the number of samples, this is actually very important. But, machine learning people have only really realized how important this kind of "implicit bias" is in the past few years, and it's quite tough to model, so assuming $\mathbf w$ is approximately independent of just $X$ given $\boldsymbol\theta$ and $\mathcal H_i$ is probably close enough.

Also: it is not true as you say in your question that $p(\boldsymbol\theta \mid X, \mathbf y, \mathbf w, \mathcal H_i) = 1$ or $p(X \mid \mathbf w, \mathcal H_i) = 1$.

thanks for your detailed answer. It was very helpful. Just a clarification: P (\theta | X, y, w, H_i ) =1 doesn't hold since for a choice of the parameters there could have been multiple hyperparameters for the model, and P( X | w, H_i ) = 1 doesn't hold since P(X) is independent of both w and H_i (It is the data itself) and also this P(X) will be the probability of the data that we observe given it comes from some distribution right? — Deepak Narayanan, Oct 22 '19 at 18:31
In general, even if $A$ is independent of $B$, $C$, then $p(A \mid B, C)$ isn't going to be 1; it'll be $p(A)$. Remember also that all of this is with probability _densities_ so 1 isn't even a particularly special value. So, $p(X \mid \mathbf w, \mathcal H_i)$ is going to just be $p(X)$, the probability of the data, which won't be 1. $p(\boldsymbol\theta \mid \cdots)$ could be the posterior for the best choice of $\boldsymbol\theta$ given everything else, which is going to be complicated. — Danica, Oct 22 '19 at 18:59

Explanation of Equation 5.3 from Gaussian Processes for Machine Learning

1 Answers1