Interpretation of multivariate conditional gaussian function form?

Question

I've been reading over this Multivariate Gaussian conditional proof, trying to make sense of how the mean and variance of a gaussian conditional was derived. I've come to accept that unless I allocate a dozen or so hours to refreshing my linear algebra knowledge, it's out of my reach for the time being.

that being said, I'm looking for a conceptual explanation for that these equations represent:

$$\mu_{1|2} = \mu_1 + \Sigma_{1,2} * \Sigma^{-1}_{2,2}(x_2 - \mu_2)$$

I read the first as "Take $\mu1$ and augment it by some factor, which is the covariance scaled by the precision (measure of how closely $X_2$ is clustered about $\mu_2$, maybe?) and projected onto the distance of the specific $x_2$ from $mu_2$."

$$\Sigma_{1|2} = \Sigma_{1,1} - \Sigma_{1,2} * \Sigma^{-1}_{2,2} * \Sigma_{1,2}$$

I read the second as, "take the variance about $\mu_1$ and subtract some factor, which is covariance squared scaled by the precision about $x_2$."

In either case, the precision $\Sigma^{-1}_{2,2}$ seems to be playing a really important role.

A few questions:

Am I right to treat precision as a measure of how closely observations are clustered about the expectation?
Why is the covariance squared in the latter equation? (Is there a geometric interpretation?) So far, I've been treating $\Sigma_{1,2} * \Sigma^{-1}_{2,2}$ as a ratio, (a/b), and so this ratio acts to scale the (second) $\Sigma_{1,2}$, essentially accounting for/damping the effect of the covariance; I don't know if this is valid.
Anything else you'd like to add/clarify?

Quick comment. From my limited experience, that is, having seen presentations in reputable machine learning textbooks (e.g. Bishop, Murphy, Jordan), I have never seen any of them mention a deeper linear algebra interpretation/viewpoint of this algebraic formula, even though the formula crops up very frequently. However, that in no way rules out the possibility that an elegant explanation may exist. I would be interested in knowing what that is also. — microhaus, Jan 01 '21 at 22:10
@microhaus, this has been my experience as well, primarily coming from an ML background. It would seem that Stats types are more interested in theory and interpretation whereas CS types are more concerned with performance and application. With the rate ML is being taught in CS curriculum, it doesn't shock me that little/no literature address questions like this. Hoping a statistician might chime in on comments/answers. — jbuddy_13, Jan 02 '21 at 15:50
I struggle with this too. I'm always tempted to compare the form of the conditional mean $\mu_{1|2} = \Sigma_{1,2} \Sigma^{-1}_{2,2}x_2$ to the classic OLS regression estimator $\hat \beta = (X^TX)^{-1}X^TY$. They seem so similar but things are just a little too inside-out for me to see whether they somehow map on to each other. (I set $\mu_1, \mu_2$ to 0 for simplicity.) — eric_kernfeld, Jan 02 '21 at 17:21
Have you tried reading this answer? I saw it in the related Q's and I find it much clearer than the typical textbook method. https://stats.stackexchange.com/a/30600/86176 — eric_kernfeld, Jan 02 '21 at 17:40

microhaus · Answer 1 · 2021-01-03T17:49:37.350

I decided to do a little digging after posting the comment above (I preferred not to spend time re-crunching through the algebra again) and thought I'd post what I found here. So this is far from a comprehensive answer.

All the more familiar sources from machine learning I revisited on "marginals and conditionals of multivariate Normals", such as, Section 4.3 Murphy's MLPP (2012) p113 -114, Section 2.3.1 Bishop's PRML (2006) p85-90, Jordan's PGM (draft) Ch13, and some supplementary notes from Andrew Ng's course here and here do not seem to provide the kinds of intuition you seem to be looking for.

They mostly see the derivation as a purely algebraic exercise. My preliminary thoughts on this would be that the best arena for developing intuition might be looking at the bivariate Normal case, particularly as you can see what is going on graphically. However, how you might systematically extend the development of intuition to the multivariate case with partitioned covariance matrices is currently beyond me.

The best I could find which provides the kind of intuition you seem to be looking for is the bottom of this link here. However, I haven't had time to assess it myself, so assimilate at your own risk.

Aside.

From my limited self-study experience, I have also noticed the distinction you raised concerning the differing epistemological priorities taken by statistics vs computer science/data science/machine learning, somewhat akin to a "computational/algorithmic turn". That distinction is discussed by Leo Breiman in "Two Cultures of Statistical Modelling" here, and more recently and comprehensively, in this excellent book by Efron and Hastie called "Computer Age Statistical Inference" here.

score 0 · Accepted Answer · answered Jan 06 '21 at 15:55

The more I think about the rough explanation I've provided in the question, the more convinced I am that I'm on the right track. I'll flush out my thoughts more below.

$\mu_{1|2} = \mu_1 + \Sigma_{1,2} * \Sigma_{2,2}^{-1} * (x_2 - \mu_2)$

We take the "naive" $\mu_1$ and augment it by some factor. So let's break down this by element (in reverse order.)

For our expectation of $\mu_1$ to change at all, it necessarily must be a function of $x_2$, thus the term $(x_2 - \mu_2)$ captures how far the specific $x_2$ has deviated from its marginal expectation. This alone is not sufficient for two reasons (A): We don't know what the distance from the expectation means in terms of $x_2$ and (B) we don't know what this means in terms of $x_1$.

$\Sigma_{2,2}^{-1}$ is a specific element of the inverse of the covariance matrix. As we know from linear algebra, this not simply a matrix of reciprocal elements. If it were, the element in position [2,2] would just be $1/\sigma_{2,2}$. High numbers would indicate that the $x_2$ is clustered very closely around its mean; conversely, low numbers would indicate that spread. However, we do need matrix inversion. And this means, "How tightly is $x_2$ clustered around its expectation with respect to its relationship with $x_1$?" And this precisely answers (A) above.

We now know the specific distance of $x_2$ from its expectation and what this means in terms of $x_2$ (with respect to its relationship with $x_1$.) The last thing that remains is clarifying the nature of the relationship between $x_2$ and $x_1$. And $\Sigma_{1,2}$ captures this exactly.

Thus this functional answers this question: "How far has $x_2$ deviated from its expectation, how surprising is this in terms of $x_2$, and what does this translate to in terms of $x_1$?"

I could write up a similar explanation for $\Sigma_{1|2}$ but I think you get the general gist at this point.

Interpretation of multivariate conditional gaussian function form?

2 Answers2

Linked

Related