I am currently reading Secret Life of Covariance Matrix: http://www.inf.fu-berlin.de/inst/ag-ki/rojas_home/documents/tutorials/secretcovariance.pdf and am confused by the following:
Now, in the case of multidimensional variables, we need something similar to the mean squared distance to the mean. If a point x is to be classified, we can measure how similar it is to a point $x_i$ (in the cluster centered around μ)by computing the square of the scalar product of the vectors relative to the center of the cluster μ.
$d\left(x, x_{i}\right)=\left((x-\mu)^{\mathrm{T}}\left(x_{i}-\mu\right)\right)^{2}$
We can repeat this computation for each data point xi, i = 1,...,N, and averaging the results: $ \begin{aligned} D(x, \mu) &=\frac{1}{N} \sum_{1}^{N}\left((x-\mu)^{\mathrm{T}}\left(x_{i}-\mu\right)\right)^{2} \\ &=\frac{1}{N} \sum(x-\mu)^{T}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}(x-\mu) \\ &=(x-\mu)^{T}\left(\frac{1}{N} \sum\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}\right)(x-\mu) \\ &=(x-\mu)^{T} \Sigma(x-\mu) \end{aligned}$
Where $\Sigma=\frac{1}{N} \sum_{i}^{N}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}$
Firstly why is distance measured as above? Should'nt it be the Mahalanobis Distance defined as : $ d_{L 2}(x, y)={(x-y)^{T}(x-y)}$?
If $A$ is my mean centered data matrix, then the covariance matrix is defined as $A^TA$ however above we have $AA^T$ as $\Sigma$. This is covariance among entries while we want covariance of features. Shouldn't it be the opposite?
I am asking because this is later used to show the variance of the one-dimensional projected as : $\sigma^{2}=\frac{1}{N} \sum_{i}^{N} u^{\mathrm{T}}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{\mathrm{T}} u=u^{\mathrm{T}} \Sigma u$ which I am keen to understand.