0

I am trying to come up with an appropriate measure of the 'distance to the normal mean' in high dimensional space and I came up with a strange result, and I need some theoretical background to comprehend it.

The case scenario is that I have measurements of p correlated variables (from the curves describing the movement of a body segment for each time instant during a particular movement) for n individuals considered 'normal' and I would like to devise an 'abnormality metric' based on this normal population to evaluate the distance of new subjects ('patients') to the 'normal pattern'.

Problem: $n<<p$ and the p variables are correlated

I thought of using pca to decorrelate and reduce the number of dimensions ( retain the first d components according to some criteria). But, when I calculate the Euclidean distance for each of the n=35 observations using the standardised coordinates (i.e. Mahalanobis distance) I get the exact same value for each observation.

I suspect this is one of the consequences of $n<<p$ of the original data since this is clearly not the case when $n>p$ (for example displayed here) I also checked that I do not get the same distance when I project a new observation on the standardised components and calculate the euclidean distance.

My questions are:

  • Is there a theoretical result about this behavior (e.g. can I predict the value of the value for the distance for all points considering the covariance structure, n and p)?
    • Should I be worried about my 'distance to the normal mean' metric?

Thanks,

M435
  • 1
  • 3
  • There is a theorem: it requires that, after centering, the $n$ points generate a space of $n-1$ dimensions. When that's the case, my description of the Mahalanobis distance easily implies all $n$ points must lie on a common $n-2$-sphere; namely, their circumsphere, and then (just as obviously) that sphere's center must be the origin. – whuber Apr 04 '19 at 15:27
  • Great thanks, I had the intuition of this. Q1: Where can I find this theorem explained/described?; Q2: Can I predict the radius of the n-2 sphere the points lie on? – M435 Apr 05 '19 at 07:42
  • Just would like to add that in my experiment, the root mean squares are all equal to 0.9856, so close to 1... – M435 Apr 05 '19 at 12:41
  • Do you actually have 35 data points rather than 30? ($\sqrt{34/35}=0.9856\ldots$) You might consider computing the covariance matrix of the data rather than the covariance *estimator*: the computation of the former divides by $n$ and the latter divides by $n-1.$ – whuber Apr 05 '19 at 13:38
  • Yes! It was 35 datapoints, I have edited the original question. So, does it mean the sphere's (hyper)radius is meant to be a function of the number of observations? – M435 Apr 05 '19 at 14:05
  • No, it just means you ought to choose the appropriate formula for the covariance matrix. Because you are comparing your data among themselves, using the Bessel correction of $n/(n-1)$ in the covariance formula is unnecessary--you're not trying to construct an unbiased estimator of anything--and only leads to this (slight) dependence of the radius with $n.$ – whuber Apr 05 '19 at 14:08
  • Ok, but that was not my point. I was not (so far) controlling the calculation of the covariance matrix since Matlab's pca function does this calculation for me. – M435 Apr 05 '19 at 14:12
  • [tried to move to chat but not enough 'reputation' to do so] Sorry, I was referring to the norm of the points being dependent of the number of observations, n, as the square root of n times 1. But I forgot to mention that I was considering the norm as the radius of the (hyper)sphere. Just realised that the root mean square is probably a better way to 'define' the radius of a (hyper)sphere. I am not used to think in high dimensional space. – M435 Apr 05 '19 at 14:24

0 Answers0