Why does inversion of a covariance matrix yield partial correlations between random variables?

Question

I heard that partial correlations between random variables can be found by inverting the covariance matrix and taking appropriate cells from such resulting precision matrix (this fact is mentioned in http://en.wikipedia.org/wiki/Partial_correlation, but without a proof).

Why is this the case?

If you mean to get partial correlation in a cell controlled for all the other variables, then the last paragraph [here](http://stats.stackexchange.com/a/43224/3277) may shed light. — ttnphns, Mar 03 '15 at 08:38

whuber · Accepted Answer · 2016-10-05T13:23:52.157

43

When a multivariate random variable $(X_1,X_2,\ldots,X_n)$ has a nondegenerate covariance matrix $\mathbb{C} = (\gamma_{ij}) = (\text{Cov}(X_i,X_j))$, the set of all real linear combinations of the $X_i$ forms an $n$-dimensional real vector space with basis $E=(X_1,X_2,\ldots, X_n)$ and a non-degenerate inner product given by

$$\langle X_i,X_j \rangle = \gamma_{ij}\ .$$

Its dual basis with respect to this inner product, $E^{*} = (X_1^{*},X_2^{*}, \ldots, X_n^{*})$, is uniquely defined by the relationships

$$\langle X_i^{*}, X_j \rangle = \delta_{ij}\ ,$$

the Kronecker delta (equal to $1$ when $i=j$ and $0$ otherwise).

The dual basis is of interest here because the partial correlation of $X_i$ and $X_j$ is obtained as the correlation between the part of $X_i$ that is left after projecting it into the space spanned by all the other vectors (let's simply call it its "residual", $X_{i\circ}$) and the comparable part of $X_j$, its residual $X_{j\circ}$. Yet $X_i^{*}$ is a vector that is orthogonal to all vectors besides $X_i$ and has positive inner product with $X_i$ whence $X_{i\circ}$ must be some non-negative multiple of $X_i^{*}$, and likewise for $X_j$. Let us therefore write

$$X_{i\circ} = \lambda_i X_i^{*},\ X_{j\circ} = \lambda_j X_j^{*}$$

for positive real numbers $\lambda_i$ and $\lambda_j$.

The partial correlation is the normalized dot product of the residuals, which is unchanged by rescaling:

$$\rho_{ij\circ} = \frac{\langle X_{i\circ}, X_{j\circ} \rangle}{\sqrt{\langle X_{i\circ}, X_{i\circ} \rangle\langle X_{j\circ}, X_{j\circ} \rangle}} = \frac{\lambda_i\lambda_j\langle X_{i}^{*}, X_{j}^{*} \rangle}{\sqrt{\lambda_i^2\langle X_{i}^{*}, X_{i}^{*} \rangle\lambda_j^2\langle X_{j}^{*}, X_{j}^{*} \rangle}} = \frac{\langle X_{i}^{*}, X_{j}^{*} \rangle}{\sqrt{\langle X_{i}^{*}, X_{i}^{*} \rangle\langle X_{j}^{*}, X_{j}^{*} \rangle}}\ .$$

(In either case the partial correlation will be zero whenever the residuals are orthogonal, whether or not they are nonzero.)

We need to find the inner products of dual basis elements. To this end, expand the dual basis elements in terms of the original basis $E$:

$$X_i^{*} = \sum_{j=1}^n \beta_{ij} X_j\ .$$

Then by definition

$$\delta_{ik} = \langle X_i^{*}, X_k \rangle = \sum_{j=1}^n \beta_{ij}\langle X_j, X_k \rangle = \sum_{j=1}^n \beta_{ij}\gamma_{jk}\ .$$

In matrix notation with $\mathbb{I} = (\delta_{ij})$ the identity matrix and $\mathbb{B} = (\beta_{ij})$ the change-of-basis matrix, this states

$$\mathbb{I} = \mathbb{BC}\ .$$

That is, $\mathbb{B} = \mathbb{C}^{-1}$, which is exactly what the Wikipedia article is asserting. The previous formula for the partial correlation gives

$$\rho_{ij\cdot} = \frac{\beta_{ij}}{\sqrt{\beta_{ii} \beta_{jj}}} = \frac{\mathbb{C}^{-1}_{ij}}{\sqrt{\mathbb{C}^{-1}_{ii} \mathbb{C}^{-1}_{jj}}}\ .$$

edited Oct 05 '16 at 13:23

answered Jun 13 '15 at 16:00

whuber

281,159
54
637
1,101

3

+1, great answer. But why do you call this dual basis "dual basis with respect to this inner product" -- what does "with respect to this inner product" exactly mean? It seems that you use the term "dual basis" as defined here http://mathworld.wolfram.com/DualVectorSpace.html in the second paragraph ("Given a vector space basis $v_1, ..., v_n$ for $V$ there exists a dual basis...") or here https://en.wikipedia.org/wiki/Dual_basis, and it's independent of any scalar product. – amoeba Nov 11 '15 at 00:57
4

@amoeba There are two kinds of duals. The (natural) dual of any vector space $V$ over a field $R$ is the set of linear functions $\phi:V\to R$, called $V^*$. There is no canonical way to identify $V^*$ with $V$, even though they have the same dimension when $V$ is finite-dimensional. Any inner product $\gamma$ corresponds to such a map $g:V\to V^*$, and *vice versa*, via $$g(v)(w)=\gamma(v,w).$$ (Nondegeneracy of $\gamma$ ensures $g$ is a vector space isomorphism.) This gives a way to view elements of $V$ as if they were elements of the dual $V^*$--but it depends on $\gamma$. – whuber Nov 11 '15 at 01:22
I think I understood where I was confused. The thing is that $X_i^* \ne g(X_i)$. Scalar product in $V$ allows to construct mapping $g(\cdot)$, and $X_i^*$ is then defined as $X_i^* = g(W_i) : \langle W_i, X_j \rangle = \delta_{ij}$. In your post you write $X_i^*$ instead of $W_i$, which is fine because of the isomorphism between $V$ and $V^*$. – amoeba Nov 12 '15 at 23:40
@whuber: I appreciate the detail in the answer, thank you. I'm a bit confused on expanding the dual basis elements on the original elements. Above, in the section where you "expand the dual basis elements in terms of the original basis E", why do you need the whole sum? Just prior to that, don't you make the argument that $X_i = \lambda_i X_i^*$? At which point I'm missing the point of expanding $X_i^*$ over the whole basis. – mpettis Dec 18 '15 at 16:57
(locked out of previous edit, continuing that thought): Shouldn't the expansion reduce to $\beta_{ij} = 1 / \lambda_i \cdot \delta_{ij}$? – mpettis Dec 18 '15 at 17:07
Gah, sorry. Ignore my comments. Those equations apply to the partial correlation, not to the original variables. I overlooked the $\cdot$ in the discussion, and so mixed up the original variables with their partials (which @whuber references as "residuals"). – mpettis Dec 18 '15 at 17:18
3

@mpettis Those dots were hard to notice. I have replaced them with small open circles to make the notation easier to read. Thanks for pointing this out. – whuber Dec 18 '15 at 18:22
1

I have a problem with the argument. It says that the partial correlation is correlation between residuals of projections to the space all vectors except the single vector in hand but in the Wikipedia says that this is correlation between residuals of projections to the space all vectors except BOTH vectors under considerations. To quote from Wikipedia: two variables Xi and Xj of a set V of cardinality n, given all others, i.e., \mathbf {V} \setminus \{X_{i},X_{j}\}, if the correlation matrix (or alternatively covariance matrix) Ω = (ωij), where ωij = ρXiXj, is positive definite and therefore i – Dec 18 '15 at 15:58
@KrzysztofPodgorski The two definitions are mathematically equivalent. – whuber Dec 18 '15 at 18:25
@whuber Great answer, thanks! As an aside though, could you recommend me a book that explores this kind of intersection between (rather geometric) linear algebra and probability? I've run into similar ideas several times now and would like to put them in context. – Andy Jones Dec 26 '15 at 14:59
4

@Andy Ron Christensen's [*Plane Answers to Complex Questions*](http://link.springer.com/book/10.1007%2F978-1-4419-9816-3) might be the sort of thing you are looking for. Unfortunately, his approach makes (IMHO) undue reliance on coordinate arguments and calculations. In the original introduction (see p. xiii), Christensen explains that's for pedagogical reasons. – whuber Dec 26 '15 at 15:06
3

@whuber, Your proof is awesome. I wonder whether any book or article contains such a proof so that I can cite. – Harry Jan 01 '16 at 03:42
1

@whuber, the partial correlation formula has a negative sign from Wiki. How can we get the negative sign from your answer? thanks. – ahala Aug 26 '16 at 15:29
@whuber, you must change "dot" to "small open circle" in the _last_ line as well: rhodot to rhosmallopencircle. – Erdogan CEVHER Dec 09 '17 at 22:46
2

Where in the proof do we show that $ = \beta_{ij}$? My apology if I misunderstand any part. – Heisenberg Feb 22 '18 at 07:50
2

This answer is not technically correct and needs to be removed. It tries to show a relationship between partial correlations and entries of the precision matrix, but it uses an incorrect definition of the former (residualizing on all n-1 variables, not n-2). Consequently, it's not surprising that it ends up with a sign error (as the comments have noted). An authoritative reference is (Lauritzen, p130) is noted in the other answer. – user357269 Aug 03 '20 at 22:24

Po C. · Answer 2 · 2018-03-16T22:37:51.110

Here is a proof with just matrix calculations.

I appreciate the answer by whuber. It is very insightful on the math behind the scene. However, it is still not so trivial how to use his answer to obtain the minus sign in the formula stated in the wikipediaPartial_correlation#Using_matrix_inversion. $$ \rho_{X_iX_j\cdot \mathbf{V} \setminus \{X_i,X_j\}} = - \frac{p_{ij}}{\sqrt{p_{ii}p_{jj}}} $$

To get this minus sign, here is a different proof I found in "Graphical Models Lauriten 1995 Page 130". It is simply done by some matrix calculations.

The key is the following matrix identity: $$ \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} = \begin{pmatrix} E^{-1} & -E^{-1}G \\ -FE^{-1} & D^{-1}+FE^{-1}G \end{pmatrix} $$ where $E = A - BD^{-1}C$, $F = D^{-1}C$ and $G = BD^{-1}$.

Write down the covariance matrix as $$ \Omega = \begin{pmatrix} \Omega_{11} & \Omega_{12} \\ \Omega_{21} & \Omega_{22} \end{pmatrix} $$ where $\Omega_{11}$ is covariance matrix of $(X_i, X_j)$ and $\Omega_{22}$ is covariance matrix of $\mathbf{V} \setminus \{X_i, X_j \}$.

Let $P = \Omega^{-1}$. Similarly, write down $P$ as $$ P = \begin{pmatrix} P_{11} & P_{12} \\ P_{21} & P_{22} \end{pmatrix} $$

By the key matrix identity, $$ P_{11}^{-1} = \Omega_{11} - \Omega_{12}\Omega_{22}^{-1}\Omega_{21} $$

We also know that $\Omega_{11} - \Omega_{12}\Omega_{22}^{-1}\Omega_{21}$ is the covariance matrix of $(X_i, X_j) | \mathbf{V} \setminus \{X_i, X_j\}$ (from Multivariate_normal_distribution#Conditional_distributions). The partial correlation is therefore $$ \rho_{X_iX_j\cdot \mathbf{V} \setminus \{X_i,X_j\}} = \frac{[P_{11}^{-1}]_{12}}{\sqrt{[P_{11}^{-1}]_{11}[P_{11}^{-1}]_{22}}}. $$ I use the notation that the $(k,l)$th entry of the matrix $M$ is denoted by $[M]_{kl}$.

Just simple inversion formula of 2-by-2 matrix, $$ \begin{pmatrix} [P_{11}^{-1}]_{11} & [P_{11}^{-1}]_{12} \\ [P_{11}^{-1}]_{21} & [P_{11}^{-1}]_{22} \\ \end{pmatrix} = P_{11}^{-1} = \frac{1}{\text{det} P_{11}} \begin{pmatrix} [P_{11}]_{22} & -[P_{11}]_{12} \\ -[P_{11}]_{21} & [P_{11}]_{11} \\ \end{pmatrix} $$

Therefore, $$ \rho_{X_iX_j\cdot \mathbf{V} \setminus \{X_i,X_j\}} = \frac{[P_{11}^{-1}]_{12}}{\sqrt{[P_{11}^{-1}]_{11}[P_{11}^{-1}]_{22}}} = \frac{- \frac{1}{\text{det}P_{11}}[P_{11}]_{12}}{\sqrt{\frac{1}{\text{det}P_{11}}[P_{11}]_{22}\frac{1}{\text{det}P_{11}}[P_{11}]_{11}}} = \frac{-[P_{11}]_{12}}{\sqrt{[P_{11}]_{22}[P_{11}]_{11}}} $$ which is exactly what the Wikipedia article is asserting.

If we let `i=j`, then `rho_ii V\{X_i, X_i} = -1`, How do we interpret those diagonal elements in the precision matrix? — Jason, May 23 '18 at 03:19
Good point. The formula should be only valid for i=/=j. From the proof, the minus sign comes from the 2-by-2 matrix inversion. It would not happen if i=j. — Po C., May 23 '18 at 03:42
So the diagonal numbers can't be associated with partial correlation. What do they represent? They are not just inverses of the variances, are they? — Jason, May 23 '18 at 04:49
But the diagonal elements do not show a minus sign, see `[P_11^-1]_11`! Thus, I would say, the forlmula is always correct and the diagonal elements are `1`. — Christoph, May 27 '20 at 14:38

score 6 · Answer 3 · answered Oct 09 '16 at 22:21

Note that the sign of the answer actually depends on how you define partial correlation. There is a difference between regressing $X_i$ and $X_j$ on the other $n - 1$ variables separately vs. regressing $X_i$ and $X_j$ on the other $n - 2$ variables together. Under the second definition, let the correlation between residuals $\epsilon_i$ and $\epsilon_j$ be $\rho$. Then the partial correlation of the two (regressing $\epsilon_i$ on $\epsilon_j$ and vice versa) is $-\rho$.

This explains the confusion in the comments above, as well as on Wikipedia. The second definition is used universally from what I can tell, so there should be a negative sign.

I originally posted an edit to the other answer, but made a mistake - sorry about that!

MathFoliage · Answer 4 · 2020-04-19T01:25:05.070

For another perspective, this will examine the left inverse of a finite data matrix $A$. We can consider the data to be a sample rather than a theoretical distribution. While any distribution -- even continuous -- will have a covariance matrix, you can't generally talk about a data matrix unless you get into infinite vectors and/or special inner products.

So we have a finite sample in an n-by-m data matrix $A$. Let each column be one random variable. Then it's $n$ samples and $m$ random variables. Let $A$'s columns (the random variables) be linearly independent (this is independence in the linear algebra sense, not as in independent random variables).

Let $A$ be mean-centered already. Then,

$$ C = \frac{1}{n}A^TA $$

is our covariance matrix. It's invertible since $A$'s columns are linearly independent.

And we'll use later that $C^{-1} = n(A^TA)^{-1}$

The left inverse of $A$ is

$B = (A^TA)^{-1}A^T$.

And we have

$BA = I_{m-by-m}$.

What do we know about $B$?

It's m-by-n. There's a row of $B$ corresponding to each column of $A$.
Because $BA = I$, we know the inner product of the $i$th row of $B$ with the $i$th column in $A$ equals 1 (diagonal of $I$).
An inner product of the $i$th row of $B$ with a $j$th ($i \neq j$) column of $A$ is 0 (off-diagonal of $I$).
The right-most term in the expression for $B$ is $A^T$. Therefore $B$'s rows are in the rowspace of $A^T$, the column space of $A$.
by (4) and the fact that $A$'s columns are mean-centered, $B$'s rows must also be mean-centered.

Let $x_i$ be the $i$th column of $A$.

The only vectors that have a non-zero inner product with the $x_i$, zero inner product with all other $x_j$, and are linear combinations of the columns of $A$, are vectors parallel to the residual of $x_i$ after projecting it into the space spanned by all the other $x_j$.

Call these residuals $r_{i}$. And call the projection (the linear regression result) $p_i$. So the $i$th row of $B$ must be parallel to $r_i$ (6).

Now we know its direction, but what about magnitude? Let $b_i$ be the $i$th row of $B$.

$$ \begin{align} 1 & = b_i \cdot x_i &&\text{by (2)} \\ & = b_i \cdot (p_i + r_i) &&\text{$x_i$ is the sum of its projection and residual}\\ & = (b_i \cdot p_i) + (b_i \cdot r_i) &&\text{linearity of dot product} \\ & = 0 + (b_i \cdot r_i) &&\text{by (3), and that $p_i$ is a linear combination of the $x_j$s ($j \neq i$)} \\ & = (c_i r_i) \cdot r_i &&\text{for some constant $c_i$, by (6)} \\ \end{align} $$

Therefore, $c_i = \dfrac{1}{r_i \cdot r_i} = \dfrac{1}{\|r_i\|^2}$, so $b_i = \dfrac{r_i}{\|r_i\|^2}$.

We now know what each row of $B$ looks like. Notice

$BB^T = ((A^TA)^{-1}A^T)(A((A^TA)^{-1})^T) = (A^TA)^{-1} = \frac{1}{n}C^{-1}$

We can look at any $i,j$th element

$C^{-1}_{ij} = n(BB^T)_{ij} = n (b_i \cdot b_j) = n\dfrac{r_i \cdot r_j}{\|r_i\|^2\|r_j\|^2}$

The $(r_i \cdot r_j)$ part of that should tell you we're getting close to covariances and correlations of these residuals. Conveniently, the diagonal elements look like

$C^{-1}_{ii} = n\dfrac{r_i \cdot r_i}{\|r_i\|^2\|r_i\|^2} = n\dfrac{1}{\|r_i\|^2}$.

This quantity is exactly 1 over the variance of the residual $r_i$, $\dfrac{\|r_i\|^2}{n}$ (the $n$ makes it a variance instead of a squared vector magnitude).

Then to get partial correlations you just need to combine the elements of $C^{-1}$ in the way others have shown.

Why does inversion of a covariance matrix yield partial correlations between random variables?

4 Answers4

Linked