Secret Life of Covariance Matrix

Question

I am currently reading Secret Life of Covariance Matrix: http://www.inf.fu-berlin.de/inst/ag-ki/rojas_home/documents/tutorials/secretcovariance.pdf and am confused by the following:

Now, in the case of multidimensional variables, we need something similar to the mean squared distance to the mean. If a point x is to be classified, we can measure how similar it is to a point $x_i$ (in the cluster centered around μ)by computing the square of the scalar product of the vectors relative to the center of the cluster μ.

$d\left(x, x_{i}\right)=\left((x-\mu)^{\mathrm{T}}\left(x_{i}-\mu\right)\right)^{2}$

We can repeat this computation for each data point xi, i = 1,...,N, and averaging the results: $ \begin{aligned} D(x, \mu) &=\frac{1}{N} \sum_{1}^{N}\left((x-\mu)^{\mathrm{T}}\left(x_{i}-\mu\right)\right)^{2} \\ &=\frac{1}{N} \sum(x-\mu)^{T}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}(x-\mu) \\ &=(x-\mu)^{T}\left(\frac{1}{N} \sum\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}\right)(x-\mu) \\ &=(x-\mu)^{T} \Sigma(x-\mu) \end{aligned}$

Where $\Sigma=\frac{1}{N} \sum_{i}^{N}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}$

Firstly why is distance measured as above? Should'nt it be the Mahalanobis Distance defined as : $ d_{L 2}(x, y)={(x-y)^{T}(x-y)}$?

If $A$ is my mean centered data matrix, then the covariance matrix is defined as $A^TA$ however above we have $AA^T$ as $\Sigma$. This is covariance among entries while we want covariance of features. Shouldn't it be the opposite?

I am asking because this is later used to show the variance of the one-dimensional projected as : $\sigma^{2}=\frac{1}{N} \sum_{i}^{N} u^{\mathrm{T}}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{\mathrm{T}} u=u^{\mathrm{T}} \Sigma u$ which I am keen to understand.

If A has centered columns (variables) then AA' is sometimes called Gram matrix in literature, it is, for example, used in Torgerson's multidimensional scaling. Read, e.g. "Just a bit formally about..." in https://stats.stackexchange.com/a/183930/3277, search also "Gram matrix" on this site. — ttnphns, Sep 06 '19 at 16:33
Ok I will check it out but why is this called covariance here then? Am I wrong in my view that covariance should be of columns and not rows? — Rahul Deora, Sep 06 '19 at 17:42
If you transpose A, its rows are now its columns, hence it covariance (A')'(A') = AA'. In data analysis tradition, we usually reserve term "covariance matrix" for features/variables (which convetionally form columns of A). But, in more general mathematical sense, it is OK to say "covariance between rows of the data". — ttnphns, Sep 06 '19 at 18:57

Secret Life of Covariance Matrix

0 Answers0