4

I understand in theory why the Mahalanobis distance is a good measure for mutlivariate outlier detection. However, everything I tend to read warns against calculating the inverse/pseudoinverse of a covariance matrix, which is needed to compute the mahalanobis distance.

So, if nobody wants to compute the inverse, what distance measure should be used?

user603
  • 21,225
  • 3
  • 71
  • 135
Aly
  • 1,149
  • 2
  • 15
  • 24
  • "However, everything I tend to read warns against calculating the inverse/pseudoinverse of a covariance matrix" can you add a reference? The answer to your question depends on the context (and more particularly on the #of dimensions $p$) – user603 Feb 27 '13 at 15:07
  • @user603 As I have just learnt, I have a 21 dimension feature (sums to 1) and am trying to construct a covariance matrix from ~70 samples. This produces an ill conditioned covariance matrix, so the calculation of inverse/pseudoinverse is highly sensitive to small numerical change. Making the Mahalanobis distance inappropriate for me. Are there other alternatives? – Aly Feb 27 '13 at 15:20
  • 1
    One immediate issue is that your data "(sums to 1)". Consider a 2-d case, where data is of the form (x, y) where y = 1-x. The variables are perfectly correlated, hence of course the covariance matrix will be ill-formed as it looks like [ 1, 1; 1, 1]. Basically you have a redundant variable (back to my example, if you know x, you definetely know y). Try dropping a single variable. – Cam.Davidson.Pilon Feb 27 '13 at 16:38
  • 1
    If the authors of your literature are honest, they will eat their own words when they encounter linear regression. – Cam.Davidson.Pilon Feb 27 '13 at 16:41
  • 1
    as Cam.Davidson.Pilon wrote, you problems are not caused by by the mahalanobis distances per-se but because you are dealing with so called compositional data. You have to first transform your data in a specific way. See the pointer to a short intro about this type of data in my answer to a related [question](http://stats.stackexchange.com/a/50430/603). By the way, one of the most popular such transformations basically amounts to doing what I recommended in an [answer](http://stats.stackexchange.com/a/49914/603) to one of your previous questions. – user603 Feb 27 '13 at 16:48
  • @user603 Just skimming the article on compositional data, it states that all elements of my vector must be strictly >0, this is not he case for me. All elements must sum to 1 (or the same constant k) but particular elements may be 0. Does the pseudo-mahalanobis distance using SVD still apply? – Aly Feb 28 '13 at 12:21
  • the pseudo mahal. still applies and you will end up with $p*\leq p-1$ (but of course the other transformation models based on the ratio of logs() will no longer make sense). – user603 Feb 28 '13 at 12:27

1 Answers1

2

One of the issues is that all the variables could be on different scales. Suppose that two of your variables are income and gender, the former in dollars and the latter as a 0-1 indicator variable. Which is further away: being off by 1 unit in income or 1 unit in gender? Being just a dollar away is pretty good, but having the wrong sex is as far away as you can get. You need to normalize how far these distances are; standard Euclidean distance doesn't do this. The variance-covariance matrix rescales the variables to make distances more comparable.

Charlie
  • 13,124
  • 5
  • 38
  • 68
  • If my variables all have the same scale, i.e. my vector is a histogram, How can I use the euclidean distance to the mean of my samples in a hypothesis test to determine outliers? – Aly Feb 27 '13 at 16:24
  • 1
    `The variance-covariance matrix rescales the variables` sounds strange, - it is not the correlation matrix. – ttnphns Feb 27 '13 at 16:50
  • If you divide each distance by the standard deviation of the respective variable, you get a measure of the distance in standard deviations. This gives a unitless measure that can be compared across observations. – Charlie Feb 27 '13 at 17:24
  • A distance is in multivariable space. Which of the variables is "respective" to it? – ttnphns Feb 27 '13 at 18:07