3

I got this problem while I was reading the book "Machine Learning: A Probabilistic Perspective" by Kevin Murphy. It is in section 7.6.1 of the book.

Assume the likelihood is given by

$$ \begin{split} p(\mathbf{y}|\mathbf{X},\mathbf{w},\mu,\sigma^2) & = \mathcal{N}(\mathbf{y}|\mu+\mathbf{X}\mathbf{w}, \sigma^2\mathbf{I}_N) \\ & \propto (-\frac{1}{2\sigma^2}(\mathbf{y}-\mu\mathbf{1}_N - \mathbf{X}\mathbf{w})^T(\mathbf{y}-\mu\mathbf{1}_N - \mathbf{X}\mathbf{w})) \end{split} \tag{7.53} $$

$\mu$ and $\sigma^2$ are scalars. $\mu$ serves as an offset. $\mathbf{1}_N$ is a column vector with length $N$.

We put an improper prior on $\mu$ of the form $p(u) \propto 1$ and then integrate it out to get

$$ p(\mathbf{y}|\mathbf{X},\mathbf{w},\sigma^2) \propto (-\frac{1}{2\sigma^2}||\mathbf{y}-\bar{y}\mathbf{1}_N - \mathbf{X}\mathbf{w}||_2^2) \tag{7.54} $$

where $\bar{y}=\frac{1}{N}\sum_{i=1}^{N}y_i$ is the empirical mean of the output.

I tried to expand the formula (last line in $7.53$) to integrate directly but failed.

Any idea or hint on how to derive from $(7.53)$ to $(7.54)$?

Xi'an
  • 90,397
  • 9
  • 157
  • 575
zwcikyf
  • 33
  • 5

1 Answers1

5

This calculation assumes that the columns of the design matrix have been centred, so that:

$$(\mathbf{Xw}) \cdot \mathbf{1}_N = \mathbf{w}^\text{T} \mathbf{X}^\text{T} \mathbf{1}_N = \mathbf{w}^\text{T} \mathbf{0} = 0.$$

With this restriction you can rewrite the quadratic form as a quadratic in $\mu$ plus a term that does not depend on $\mu$ as follows:

$$\begin{equation} \begin{aligned} || \mathbf{y} - \mu \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 &= || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} + (\bar{y} - \mu) \mathbf{1}_N ||^2 \\[6pt] &= || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 + 2 (\bar{y} - \mu) (\mathbf{y} - \mu \mathbf{1}_N - \mathbf{X} \mathbf{w}) \cdot \mathbf{1}_N + (\bar{y} - \mu)^2 || \mathbf{1}_N ||^2 \\[6pt] &= || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 - 2 n (\bar{y} - \mu)^2 + n (\bar{y} - \mu)^2 \\[6pt] &= || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 - n (\mu - \bar{y})^2. \\[6pt] \end{aligned} \end{equation}$$

Hence, with the improper prior $\pi(\mu) \propto 1$ you have:

$$\begin{equation} \begin{aligned} p(\mathbf{y}|\mathbf{X},\mathbf{w},\sigma^2) &= \int \limits_\mathbb{R} p(\mathbf{y}|\mathbf{X},\mathbf{w},\mu,\sigma^2) \pi(\mu) \ d \mu \\[6pt] &\overset{\mathbf{y}}{\propto} \int \limits_\mathbb{R} \exp \Big( -\frac{1}{2\sigma^2} || \mathbf{y}-\mu\mathbf{1}_N - \mathbf{X}\mathbf{w} ||^2 \Big) \ d \mu \\[6pt] &= \exp \Big( -\frac{1}{2\sigma^2} || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 \Big) \int \limits_\mathbb{R} \exp \Big( -\frac{n}{2\sigma^2} (\mu - \bar{y})^2 \Big) \ d \mu \\[6pt] &\overset{\mathbf{y}}{\propto} \exp \Big( -\frac{1}{2\sigma^2} || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 \Big) \int \limits_\mathbb{R} \text{N} \Big( \mu \Big| \bar{y}, \frac{\sigma^2}{n} \Big) \ d \mu \\[6pt] &= \exp \Big( -\frac{1}{2\sigma^2} || \mathbf{y} - \bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w} ||^2 \Big). \\[6pt] \end{aligned} \end{equation}$$

Thus, your posterior distribution is:

$$\mathbf{y}|\mathbf{X},\mathbf{w},\sigma^2 \sim \text{N}(\bar{y} \mathbf{1}_N - \mathbf{X} \mathbf{w}, \sigma^2).$$

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Just for consistency of matrix notations. I think the first assumption should be $(\mathbf{X}\mathbf{w})^T\mathbf{1}_N$. And the middle term in second line in the second equation should be $2(\bar{y}-\mu)(\mathbf{y}-\mu \mathbf{1}_N-\mathbf{X}\mathbf{w})^T\mathbf{1}_N$. – zwcikyf Feb 15 '19 at 04:40
  • 2
    @zwcikyf: The dot product does exactly that (i.e., $(\mathbf{Xw}) \cdot \mathbf{1}_N = (\mathbf{Xw})^\text{T} \mathbf{1}_N$). That is standard linear algebra notation. – Ben Feb 15 '19 at 04:53