Clarity on Covariance Matrix and it's relation to length

Question

I have been trying hard since a month to get a clear intuition on the relation of the covariance matrix to length of projections(maximum or otherwise). While I understand conceptually that if I have a data matrix $A$ of size $[N,2]$ and I want to project these onto some vector $x$ of size $[2,1]$ then $Ax$ will give me a list of projections for each data point in $A$.

Now we say we want to maximise the variance of these projections which is equivalent to maximising the length of $Ax$ so we want to $max((Ax)^T(Ax))$ with respect to $x$. I understand things clearly upto this point.

Rewriting of this as $max(x^T(A^TA)x)$ starts to trouble me a bit.

Say $e$ is the vector that maximises this. I would like to interpret what happens to $e$ step by step here. First we $Ae$ will project $e$ into the space of $A$. Then $A^T(Ae)$ will get the projection lengths of this vector to the basis vectors in $A$. And finally $x^T(A^TAx) $ will take a linear combination of these lengths to give the final output. I don't see clearly how the last two steps should be interpreted as giving length. I understand on the whole conceptually but not how it works in the tiny details as above.

I would also like to know where co-variance plays a role in the above way of looking at things. (I understand $A^TA$ is the covariance matrix but where does it appear in length calculations.

The $e$ vector is usually a unit vector in order to scale appropriately. To recover the scaling in A this is then multiplied by a scalar (in eigen decomposition this will be the eigenvalue). A key benefit is that a scalar can be easily inverted for the inverse operation to multiplication. — ReneBt, Jan 08 '20 at 06:27
$A^\prime A$ is a covariance matrix only when its columns have means of zero. That would seem to contradict the information you see clearly, so you might want to revisit your assumptions. — whuber, Jan 08 '20 at 07:14
@whuber Right, lets say it is normalised, I am just looking for a step-by step interpretation of $x^T(A^TA)x$. What would it be then? — Rahul Deora, Jan 10 '20 at 06:16
Because $$x^\prime (A^\prime A) x = (Ax)^\prime\, (Ax) = |Ax|^2,$$ a good interpretation is that it's the squared Euclidean length of $Ax.$ — whuber, Jan 10 '20 at 14:11

MathFoliage · Accepted Answer · 2020-06-04T06:52:56.317

So you're okay with $max((Ax)^T(Ax))$ but not $max(x^T(A^TA)x)$.

You may already know that we can get equivalence between these by just applying a few basic rules of matrix algebra -- transpose of product, and associativity:

$$(Ax)^T(Ax) = (x^TA^T)(Ax) = x^TA^TAx = x^T(A^TA)x.$$

Even if you already know this you may want different ways to interpret the expression depending on where we put the parentheses. I get that. It might be useful to try to gain intuition for matrix associativity beyond simple algebraic proofs in a more general context than statistical matrices. But I can try to talk about the statistical interpretations.

For this to apply to covariance matrices and maximizing variance, then $A$ has to be centered.

Let

$$ A = \begin{bmatrix} x_1 & y_1 & z_1 \\ x_2 & y_2 & z_2 \\ \vdots & \vdots & \vdots\\ x_n & y_n & z_n \end{bmatrix} $$

We have $n$ samples / data points / whatever, and each one is a 3-dimensional (row) vector. The matrix shape is n-by-3.

connection betwen variance and length

There are actually two lengths of interest here -- the length of our data points (their distance from the origin) in the 3-dimensional row world, and the "length" of each variable in the n-dimensional column world.

Each sample (row), like $r_1 = \left[x_1\;y_1\;z_1\right]$, has a length (the point's distance from the origin):

$$\|r_1\| = \sqrt{\left<r_1, r_1\right>} = \sqrt{x_1^2 + y_1^2 + z_1^2}$$

At the same time, the column vector corresponding to just one of our variables (one dimension of our samples)

$$X = \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix} $$

has a length. First, consider the squared length:

$$X^TX = x_1^2 + \dots + x_n^2$$.

This should look reminiscent of $Var(X)$. Recall,

$Var(X) = E(X^2) - E(X)^2.$

Since $A$ is centered (the columns have sum/mean zero), $E(X)^2 = 0$. The only thing missing is dividing by $n$ to get a mean instead of a sum . So

$$Var(X) = \frac{X^TX}{n} \propto X^TX = \|X\|^2$$

and, for fun,

$$stdev(X) = \frac{\|X\|}{\sqrt{n}}$$

So we can basically ignore normalizing by $n$ and more or less picture $Var(X)$ as the squared length of the vector $X$. Maximizing one is equivalent to maximizing the other since $n$ is constant. Remember, this only works because $A$ is centered.

Let's introduce a unit vector $u$. Using the basic definition of matrix multiplication, and recalling that we've named the $ith$ row of $A$ (the $ith$ data point) $r_i$,

$$Au = \begin{bmatrix} \left<r_1, u\right>\\ \vdots \\ \left<r_n, u\right>\\ \end{bmatrix} = \begin{bmatrix} p_1 \\ \vdots \\ p_n \end{bmatrix} = P $$

Let's look at the lengths in the two worlds again.

$Au$ is an n-by-1 matrix. We've also named this result $P$. The rows are 1-dimensional. Keep in mind the difference between the projection of one vector onto another, which is a vector in the original 3-d space, and the (signed) length / magnitude of that projection, which is a scalar, or perhaps a 1-d vector. When we say "project the rows of $A$ onto $u$", it can be ambiguous which we mean. For now, we're interested in the latter. Since $u$ is unit-length, $\left<r_i, u\right>$ is the (signed) length of the projection of $r_i$ onto $u$. It's 1-dimensional. Think of this as a compressed version of our data. Instead of a 3-d scatter plot (the $r$'s), we now have a 1-d scatter plot (the $p$'s).

Since they're 1-dimensional, computing row lengths is trivial: $\|p_i\| = \sqrt{\left<p_i, p_i\right>} = \sqrt{p_i^2} = p_i$

Hopefully it's not too confusing that I'm kind of using $p_i$ as both a 1-d vector and as a straight-up scalar real number there. Whereas before we had

$$r_i = \left[x_i\;y_i\;z_i\right]$$ we now are doing $$p_i = \left[p_i\right]$$

which isn't great but maybe not worth the trouble to avoid.

In the same way that we considered the $p$'s to be our new data points, we can consider $P$ to be a new, single random variable, in place of $X$, $Y$, and $Z$.

Like before (since $P$ has zero mean again since its a linear combination of $X$, $Y$, and $Z$),

$$Var(P) \propto P^TP = p_0^2 + \dots + p_n^2$$

what's interesting here is that last expression. Since $P$ is a vector whose entries are $p_i$, which are (trivially) the lengths of our new 1-d data points, $P^TP$ is the sum of the squared lengths of our 1-d data points. And so $Var(P)$ is basically (disregarding having to divide by $n$) the sum of the squared lengths of our data points. More strictly, $Var(P)$ is the average of these squared lengths (sum, then divide by $n$). To be super explicit,

$$nVar(P) = \|P\|^2 = P^TP = p_0^2 + \dots + p_n^2 = \|p_0\|^2 + \dots + \|p_n\|^2$$

So that's the connection between variance and lengths. Variance is conceptually very close the squared length of our column, which in the univariate case (1-dimensional data points), turns out to be the sum of the squared lengths of our data points. Maximizing one or the other is equivalent.

interpreting different sub-expressions of $u^TA^TAu$

$Au$ produces $P$, which is a set of 1-dimensional data points determined by the projection of the original points onto $u$.

Then, $A^T(Au)$ will be a 3-by-1 list of inner products of $P$ with each of $X$, $Y$, and $Z$.

$$A^T(Au) = A^TP = \begin{bmatrix} X^TP \\ Y^TP \\ Z^TP \end{bmatrix} $$

In the same way that $Var(X) \propto X^TX$, try to work out for yourself that $Cov(X, P) \propto X^TP$ (when $X$ and $P$ have zero mean, which they do here).

Lastly,

$$u^T(A^T(Au)) = \begin{bmatrix}u_x & u_y & u_z\end{bmatrix} \begin{bmatrix} X^TP \\ Y^TP \\ Z^TP \end{bmatrix} = u_x (X^TP) + u_y (Y^TP) + u_z (Z^TP) $$

You could try to think of this as projecting that vector of covariances of $P$ with the original 3 variables ($A^TAu$) onto $u$. I don't think I've ever heard any one explicitly mention this interpretation even though it's pretty straightforward to arrive to, if not deeply grasp. But ultimately,

$$ u_x (X^TP) + u_y (Y^TP) + u_z (Z^TP) = (u_xX + u_yY + u_zZ)^TP = P^TP = (Au)^T(Au) $$

and that last expression is how most people interpret all of this -- the variance (not yet divided by $n$) of your new random variable $P$, which is its squared Euclidean length since $P$ is centered.

Some other useful resources in this area:

Thanks for great explanation. I understood perfectly till 'You could try to think of this as projecting that vector of covariances of P with the original 3 variables (ATAu) onto u.' Why do we want to do this? This step is not making sense to me — Rahul Deora, May 09 '20 at 04:03
@RahulDeora, the reason we left-multiply (ATAu) by uT is because we have to add in that term to get the full (uT AT A u). The comment about "thinking of this as projecting" is just an interpretation of this. It's not a particularly useful interpretation and I don't yet have a better one. — MathFoliage, May 15 '20 at 22:04

Clarity on Covariance Matrix and it's relation to length

1 Answers1

connection betwen variance and length

interpreting different sub-expressions of $u^TA^TAu$