159

We have a multivariate normal vector ${\boldsymbol Y} \sim \mathcal{N}(\boldsymbol\mu, \Sigma)$. Consider partitioning $\boldsymbol\mu$ and ${\boldsymbol Y}$ into $$\boldsymbol\mu = \begin{bmatrix} \boldsymbol\mu_1 \\ \boldsymbol\mu_2 \end{bmatrix} $$ $${\boldsymbol Y}=\begin{bmatrix}{\boldsymbol y}_1 \\ {\boldsymbol y}_2 \end{bmatrix}$$

with a similar partition of $\Sigma$ into $$ \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} $$ Then, $({\boldsymbol y}_1|{\boldsymbol y}_2={\boldsymbol a})$, the conditional distribution of the first partition given the second, is $\mathcal{N}(\overline{\boldsymbol\mu},\overline{\Sigma})$, with mean
$$ \overline{\boldsymbol\mu}=\boldsymbol\mu_1+\Sigma_{12}{\Sigma_{22}}^{-1}({\boldsymbol a}-\boldsymbol\mu_2) $$ and covariance matrix $$ \overline{\Sigma}=\Sigma_{11}-\Sigma_{12}{\Sigma_{22}}^{-1}\Sigma_{21}$$

Actually these results are provided in Wikipedia too, but I have no idea how the $\overline{\boldsymbol\mu}$ and $\overline{\Sigma}$ is derived. These results are crucial, since they are important statistical formula for deriving Kalman filters. Would anyone provide me a derivation steps of deriving $\overline{\boldsymbol\mu}$ and $\overline{\Sigma}$ ? Thank you very much!

Macro
  • 40,561
  • 8
  • 143
  • 148
Flying pig
  • 5,689
  • 11
  • 32
  • 31
  • 38
    The idea is to use the definition of conditional density $f(y_1\vert y_2=a)=\dfrac{f_{Y_1,Y_2}(y_1,a)}{f_{Y_2}(a)}$. You know that the joint $f_{Y_1,Y_2}$ is a bivariate normal and that the marginal $f_{Y_2}$ is a normal then you just have to replace the values and do the unpleasant algebra. These [notes](http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/gauss.pdf) might be of some help. [Here](http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html) is the full proof. –  Jun 16 '12 at 18:16
  • 1
    Your second link answers the question (+1). Why not put it as an answer @Procrastinator? – gui11aume Jun 16 '12 at 22:54
  • 1
    I hadn't realized it, but I think I was implicitly using this equation in a conditional PCA. The conditional PCA requires a transformation $\left(I-A'\left(AA'\right)^{-1}A\right)\Sigma$ that is effectively calculating the conditional covariance matrix given some choice of A. – John Jul 02 '12 at 15:49
  • @Procrastinator - your approach actually requires the knowledge of the Woodbury matrix identity, and the knowledge of block-wise matrix inversion. These result in unnecessarily complicated matrix algebra. – probabilityislogic Jul 02 '12 at 16:17
  • in fact you can use macro's simpler answer to prove both of those identities (not sure if Woodbury identity is provable in general using this, but definitely a special case is). – probabilityislogic Jul 02 '12 at 16:21
  • 2
    @probabilityislogic Actually the result is proved in the link I provided. But it is respectable if you find it more complicated than other methods. In addition, I was not attempting to provide an optimal solution in my *comment*. Also, my comment was previous to Macro's answer (which I upvoted as you can see). –  Jul 02 '12 at 16:25
  • Macro's answer above is great. As a supplement, we still need characteristic function to prove that the conditional distribution is normal. See Example 10.20 in this [notes](https://www.ma.utexas.edu/users/gordanz/notes/conditional_expectation.pdf). – zhengli0817 Dec 07 '16 at 17:46
  • @user10525 the conditional probability density: that's exactly what I am asking about! Would you please have a look at this question? https://stats.stackexchange.com/q/458032/5509 – Tomas Apr 02 '20 at 12:37

2 Answers2

153

You can prove it by explicitly calculating the conditional density by brute force, as in Procrastinator's link (+1) in the comments. But, there's also a theorem that says all conditional distributions of a multivariate normal distribution are normal. Therefore, all that's left is to calculate the mean vector and covariance matrix. I remember we derived this in a time series class in college by cleverly defining a third variable and using its properties to derive the result more simply than the brute force solution in the link (as long as you're comfortable with matrix algebra). I'm going from memory but it was something like this:


Let ${\bf x}_{1}$ be the first partition and ${\bf x}_2$ the second. Now define ${\bf z} = {\bf x}_1 + {\bf A} {\bf x}_2 $ where ${\bf A} = -\Sigma_{12} \Sigma^{-1}_{22}$. Now we can write

\begin{align*} {\rm cov}({\bf z}, {\bf x}_2) &= {\rm cov}( {\bf x}_{1}, {\bf x}_2 ) + {\rm cov}({\bf A}{\bf x}_2, {\bf x}_2) \\ &= \Sigma_{12} + {\bf A} {\rm var}({\bf x}_2) \\ &= \Sigma_{12} - \Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22} \\ &= 0 \end{align*}

Therefore ${\bf z}$ and ${\bf x}_2$ are uncorrelated and, since they are jointly normal, they are independent. Now, clearly $E({\bf z}) = {\boldsymbol \mu}_1 + {\bf A} {\boldsymbol \mu}_2$, therefore it follows that

\begin{align*} E({\bf x}_1 | {\bf x}_2) &= E( {\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\ & = E({\bf z}|{\bf x}_2) - E({\bf A}{\bf x}_2|{\bf x}_2) \\ & = E({\bf z}) - {\bf A}{\bf x}_2 \\ & = {\boldsymbol \mu}_1 + {\bf A} ({\boldsymbol \mu}_2 - {\bf x}_2) \\ & = {\boldsymbol \mu}_1 + \Sigma_{12} \Sigma^{-1}_{22} ({\bf x}_2- {\boldsymbol \mu}_2) \end{align*}

which proves the first part. For the covariance matrix, note that

\begin{align*} {\rm var}({\bf x}_1|{\bf x}_2) &= {\rm var}({\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\ &= {\rm var}({\bf z}|{\bf x}_2) + {\rm var}({\bf A} {\bf x}_2 | {\bf x}_2) - {\bf A}{\rm cov}({\bf z}, -{\bf x}_2) - {\rm cov}({\bf z}, -{\bf x}_2) {\bf A}' \\ &= {\rm var}({\bf z}|{\bf x}_2) \\ &= {\rm var}({\bf z}) \end{align*}

Now we're almost done:

\begin{align*} {\rm var}({\bf x}_1|{\bf x}_2) = {\rm var}( {\bf z} ) &= {\rm var}( {\bf x}_1 + {\bf A} {\bf x}_2 ) \\ &= {\rm var}( {\bf x}_1 ) + {\bf A} {\rm var}( {\bf x}_2 ) {\bf A}' + {\bf A} {\rm cov}({\bf x}_1,{\bf x}_2) + {\rm cov}({\bf x}_2,{\bf x}_1) {\bf A}' \\ &= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22}\Sigma^{-1}_{22}\Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\ &= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\ &= \Sigma_{11} -\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21} \end{align*}

which proves the second part.

Note: For those not very familiar with the matrix algebra used here, this is an excellent resource.

Edit: One property used here this is not in the matrix cookbook (good catch @FlyingPig) is property 6 on the wikipedia page about covariance matrices: which is that for two random vectors $\bf x, y$, $${\rm var}({\bf x}+{\bf y}) = {\rm var}({\bf x})+{\rm var}({\bf y}) + {\rm cov}({\bf x},{\bf y}) + {\rm cov}({\bf y},{\bf x})$$ For scalars, of course, ${\rm cov}(X,Y)={\rm cov}(Y,X)$ but for vectors they are different insofar as the matrices are arranged differently.

Naetmul
  • 103
  • 4
Macro
  • 40,561
  • 8
  • 143
  • 148
  • Thanks for this brilliant method! There is one matrix algebra does not seem familiar to me, where can I find the formula for opening $var(x_1+Ax_2)$? I haven't found it on the link you sent. – Flying pig Jun 17 '12 at 06:35
  • @Flyingpig, you're welcome. I believe this is a result of equations $(291),(292)$, combined with an additional property of the variance of the sum of random vectors not written in the Matrix Cookbook - I've added this fact to my answer - thanks for catching that! – Macro Jun 17 '12 at 15:02
  • Wow!! +1 for the patience and the care to write all this. – gui11aume Jun 17 '12 at 15:26
  • 21
    This is a very good answer (+1), but could be improved in terms of the ordering of the approach. We start with saying we want a linear combination $z=Cx=C_1x_1+C_2x_2$ of the whole vector that is independent/uncorrelated with $x_2$. This is because we can use the fact that $p(z|x_2)=p(z)$ which means $var(z|x_2)=var(z)$ and $E(z|x_2)=E(z)$. These in turn lead to expressions for $var(C_1x_1|x_2)$ and $E(C_1x_1|x_2)$. This means we should take $C_1=I$. Now we require $cov(z,x_2)=\Sigma_{12}+C_2\Sigma_{22}=0$. If $\Sigma_{22}$ is invertible we then have $C_2=-\Sigma_{12}\Sigma_{22}^{-1}$. – probabilityislogic Jul 02 '12 at 16:00
  • The current ordering is based on the approach of proposing a linear combination, and seeing if it works. My suggesting goes more towards finding a criterion we want our linear combination to satisfy, and solving for this criterion. This way will work better on other problems. – probabilityislogic Jul 02 '12 at 16:05
  • 2
    @probabilityislogic, I'd actually never thought about the process that resulted in choosing this linear combination but your comment makes it clear that it arises naturally, considering the constraints we want to satisfy. +1! – Macro Jul 02 '12 at 20:06
  • @Macro: Great proof. There is a small typo: A var(x_1) A' should be A var(x_2) A'. Also, I guess you need to justify why the conditional distribution is indeed *normal*. You showed the form of the mean vector and covariance matrix of this conditional distribution, but not that it is indeed *normal* (and not, say, multivariate t with that mean and covariance matrix). That should easily follow from the fact that linear combinations are normal again, but I guess you should add a comment on that. – Marius Hofert Mar 20 '13 at 09:17
  • @Marius, thank you for the close read and for catching that typo. You're right that I didn't prove that the conditional distributions are indeed normal, rather I explicitly appealed to what I thought was a commonly known theorem. In any case, I took the OP's main question to be _"Would anyone provide me a derivation steps of deriving $\overline{\boldsymbol\mu}$ and $\overline{\Sigma}$ ?"_, which is why I only focused on deriving the mean and covariance. When I have time, I may consider adding the piece you mentioned (or, at least, linking to a textbook page). Cheers! – Macro Mar 20 '13 at 13:46
  • A textbook reference for the proof you gave would be nice, indeed. Cheers. – Marius Hofert Mar 20 '13 at 17:00
  • @Marco What is the reason of defining Z? How can Z and X2 be independent? – Quirik Jul 06 '16 at 10:09
  • @probabilityislogic, How can $p(z|x_2)=p(z)$ leads to $C_1=I$? – jakeoung Jan 13 '18 at 22:47
  • 1
    @jakeoung - it is not *proving* that $C_1=I$, it is setting it to this value, so that we get an expression that contains the variables we want to know about. – probabilityislogic Jan 14 '18 at 14:40
  • 1
    @jakeoung I also don't quite understand that statement. I understand in this way: If $cov(z, x_2)=0$, then $cov(C_1^{-1} z, x_2) = C_1^{-1} cov( z, x_2)=0$. So the value of $C_1$ is somehow an arbitrary scale. So we set $C_1=I$ for simplicity. – Ken T May 05 '18 at 16:03
  • This is a beautiful explanation. Is there any way to reference or reproduce/cite it? – Mathews24 Oct 25 '18 at 22:01
  • Found these [set of notes](http://www.maths.manchester.ac.uk/~mkt/MT3732%20(MVA)/Notes/MVA_Section3.pdf) which contain the above derivation along with associated proofs. – Mathews24 Jan 03 '19 at 04:22
21

The answer by Macro is great, but here is an even simpler way that does not require you to use any outside theorem asserting the conditional distribution. It involves writing the Mahalanobis distance in a form that separates the argument variable for the conditioning statement, and then factorising the normal density accordingly.


Rewriting the Mahalanobis distance for a conditional vector: This derivation uses a matrix inversion formula that uses the Schur complement $\boldsymbol{\Sigma}_* \equiv \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}$. We first use the blockwise inversion formula to write the inverse-variance matrix as:

$$\begin{equation} \begin{aligned} \boldsymbol{\Sigma}^{-1} = \begin{bmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \\ \end{bmatrix}^{-1} = \begin{bmatrix} \boldsymbol{\Sigma}_{11}^* & \boldsymbol{\Sigma}_{12}^* \\ \boldsymbol{\Sigma}_{21}^* & \boldsymbol{\Sigma}_{22}^* \\ \end{bmatrix}, \end{aligned} \end{equation}$$

where:

$$\begin{equation} \begin{aligned} \begin{matrix} \boldsymbol{\Sigma}_{11}^* = \boldsymbol{\Sigma}_*^{-1} \text{ } \quad \quad \quad \quad & & & & & \boldsymbol{\Sigma}_{12}^* = -\boldsymbol{\Sigma}_*^{-1} \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}, \quad \quad \quad \\[6pt] \boldsymbol{\Sigma}_{21}^* = - \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_*^{-1} & & & & & \boldsymbol{\Sigma}_{22}^* = \boldsymbol{\Sigma}_{22}^{-1} + \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_*^{-1} \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}. \text{ } \\[6pt] \end{matrix} \end{aligned} \end{equation}$$

Using this formula we can now write the Mahalanobis distance as:

$$\begin{equation} \begin{aligned} (\boldsymbol{y} &- \boldsymbol{\mu})^\text{T} \boldsymbol{\Sigma}^{-1} (\boldsymbol{y} - \boldsymbol{\mu}) \\[6pt] &= \begin{bmatrix} \boldsymbol{y}_1 - \boldsymbol{\mu}_1 \\ \boldsymbol{y}_2 - \boldsymbol{\mu}_2 \end{bmatrix}^\text{T} \begin{bmatrix} \boldsymbol{\Sigma}_{11}^* & \boldsymbol{\Sigma}_{12}^* \\ \boldsymbol{\Sigma}_{21}^* & \boldsymbol{\Sigma}_{22}^* \\ \end{bmatrix} \begin{bmatrix} \boldsymbol{y}_1 - \boldsymbol{\mu}_1 \\ \boldsymbol{y}_2 - \boldsymbol{\mu}_2 \end{bmatrix} \\[6pt] &= \quad (\boldsymbol{y}_1 - \boldsymbol{\mu}_1)^\text{T} \boldsymbol{\Sigma}_{11}^* (\boldsymbol{y}_1 - \boldsymbol{\mu}_1) + (\boldsymbol{y}_1 - \boldsymbol{\mu}_1)^\text{T} \boldsymbol{\Sigma}_{12}^* (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) \\[6pt] &\quad + (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{21}^* (\boldsymbol{y}_1 - \boldsymbol{\mu}_1) + (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^* (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) \\[6pt] &= \quad (\boldsymbol{y}_1 - \boldsymbol{\mu}_1)^\text{T} \boldsymbol{\Sigma}_*^{-1} (\boldsymbol{y}_1 - \boldsymbol{\mu}_1) - (\boldsymbol{y}_1 - \boldsymbol{\mu}_1)^\text{T} \boldsymbol{\Sigma}_*^{-1} \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) \\[6pt] &\quad - (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_*^{-1} (\boldsymbol{y}_1 - \boldsymbol{\mu}_1) + (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) \\[6pt] &\quad + (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_*^{-1} \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) \\[6pt] &= (\boldsymbol{y}_1 - (\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)))^\text{T} \boldsymbol{\Sigma}_*^{-1} (\boldsymbol{y}_1 - (\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2))) \\[6pt] &\quad + (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) \\[6pt] &= (\boldsymbol{y}_1 - \boldsymbol{\mu}_*)^\text{T} \boldsymbol{\Sigma}_*^{-1} (\boldsymbol{y}_1 - \boldsymbol{\mu}_*) + (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2) , \\[6pt] \end{aligned} \end{equation}$$

where $\boldsymbol{\mu}_* \equiv \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)$ is the conditional mean vector. Note that this result is a general result that does not assume normality of the random vectors involved in the decomposition. It gives a useful way of decomposing the Mahalanobis distance so that it consists of a sum of quadratic forms on the marginal and conditional parts. In the conditional part the conditioning vector $\boldsymbol{y}_2$ is absorbed into the mean vector and variance matrix. To clarify the form, we repeat the equation with labelling of terms:

$$(\boldsymbol{y} - \boldsymbol{\mu})^\text{T} \boldsymbol{\Sigma}^{-1} (\boldsymbol{y} - \boldsymbol{\mu}) = \underbrace{(\boldsymbol{y}_1 - \boldsymbol{\mu}_*)^\text{T} \boldsymbol{\Sigma}_*^{-1} (\boldsymbol{y}_1 - \boldsymbol{\mu}_*)}_\text{Conditional Part} + \underbrace{(\boldsymbol{y}_2 - \boldsymbol{\mu}_2)^\text{T} \boldsymbol{\Sigma}_{22}^{-1} (\boldsymbol{y}_2 - \boldsymbol{\mu}_2)}_\text{Marginal Part}.$$


Deriving the conditional distribution: Now that we have the above form for the Mahalanobis distance, the rest is easy. We have:

$$\begin{equation} \begin{aligned} p(\boldsymbol{y}_1 | \boldsymbol{y}_2, \boldsymbol{\mu}, \boldsymbol{\Sigma}) &\overset{\boldsymbol{y}_1}{\propto} p(\boldsymbol{y}_1 , \boldsymbol{y}_2 | \boldsymbol{\mu}, \boldsymbol{\Sigma}) \\[12pt] &= \text{N}(\boldsymbol{y} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) \\[10pt] &\overset{\boldsymbol{y}_1}{\propto} \exp \Big( - \frac{1}{2} (\boldsymbol{y} - \boldsymbol{\mu})^\text{T} \boldsymbol{\Sigma}^{-1} (\boldsymbol{y} - \boldsymbol{\mu}) \Big) \\[6pt] &\overset{\boldsymbol{y}_1}{\propto} \exp \Big( - \frac{1}{2} (\boldsymbol{y}_1 - \boldsymbol{\mu}_*)^\text{T} \boldsymbol{\Sigma}_*^{-1} (\boldsymbol{y}_1 - \boldsymbol{\mu}_*) \Big) \\[6pt] &\overset{\boldsymbol{y}_1}{\propto}\text{N}(\boldsymbol{y}_1 | \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*). \\[6pt] \end{aligned} \end{equation}$$

This establishes that the conditional distribution is also multivariate normal, with the specified conditional mean vector and conditional variance matrix.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Hi Ben. I am sorry for another question. Will the above marginal distribution hold if we don't assume normal distribution for $y_1$ and $y_2$. Then what's the conditional distribution for $y_1$ conditional $y_2$ without normal distribution? Or, is it possible to calculate the expectation and variance of $y_1$ conditional $y_2$ without normal distribution following your setup without normality assumption. Deeply appreciate for your help! Thanks – Charles Chou Nov 24 '21 at 16:48
  • @CharlesChou: No, the moments $\boldsymbol{\mu}_*$ and $\boldsymbol{\Sigma}_*$ will not generally hold outside the normal distribution. (Also, note that the above is a conditional distribution, not a marginal distribution.) – Ben Nov 24 '21 at 20:24