I am trying to come up with an appropriate measure of the 'distance to the normal mean' in high dimensional space and I came up with a strange result, and I need some theoretical background to comprehend it.
The case scenario is that I have measurements of p correlated variables (from the curves describing the movement of a body segment for each time instant during a particular movement) for n individuals considered 'normal' and I would like to devise an 'abnormality metric' based on this normal population to evaluate the distance of new subjects ('patients') to the 'normal pattern'.
Problem: $n<<p$ and the p variables are correlated
I thought of using pca to decorrelate and reduce the number of dimensions ( retain the first d components according to some criteria). But, when I calculate the Euclidean distance for each of the n=35 observations using the standardised coordinates (i.e. Mahalanobis distance) I get the exact same value for each observation.
I suspect this is one of the consequences of $n<<p$ of the original data since this is clearly not the case when $n>p$ (for example displayed here) I also checked that I do not get the same distance when I project a new observation on the standardised components and calculate the euclidean distance.
My questions are:
- Is there a theoretical result about this behavior (e.g. can I predict the value of the value for the distance for all points considering the covariance structure, n and p)?
- Should I be worried about my 'distance to the normal mean' metric?
Thanks,