7

I'm trying to calculate a Mahalanobis-type pairwise distance matrix in R. I have 33 individuals, each with 10 variables. The idea is to get a distance matrix D, where

$$D_{i,j}=(\mathbf{X}_i-\mathbf{X}_j)W^{-1}(\mathbf{X}_i-\mathbf{X}_j)^T$$

However I haven't been able build proper code for it.

Javier
  • 81
  • 1
  • 1
  • 3
  • 12
    Type `?mahalanobis` in `R` and look the documentation. – MYaseen208 Jul 19 '12 at 21:56
  • 2
    @MYaseen208: Wrong. Look below. – user603 Aug 02 '12 at 08:32
  • 5
    See also: http://stats.stackexchange.com/questions/65705/pairwise-mahalanobis-distance – russellpierce Jul 27 '13 at 21:56
  • 1
    The problem with the `mahalanobis` function in `R` as recommended by @MYaseen208 is that this calculates maha distance between a single point and a set of points, not pairwise distance between every pair of points in a set of points. See the post recommended by @rpierce for more discussion. – ahfoss Jan 08 '14 at 16:18
  • Answered here: http://stackoverflow.com/questions/29608280/mahalanobis-distance-with-multiple-observations-per-group/29614330#29614330 – Andre Silva May 06 '15 at 19:44

4 Answers4

2

The following worked for me in similar example where R is a dataframe of 54 individuals and 8 variables. Mahalanobis distance Ma between individuals X1 and X2 can be computed as ff:

# express difference (X1-X2) as atomic row vector
d <- as.matrix(X1-X2)[1,] 

# solve  (covariance matrix) %*% x = d for x
x <- solve(cov(R),d)

# Mahalanobis calculation forced in two steps
Ma <- sum(d*x)
Horst Grünbusch
  • 5,020
  • 17
  • 22
Keith
  • 21
  • 3
0

There a very easy way to do it using R Package "biotools". In this case you will get a Squared Distance Mahalanobis Matrix.

#Manly (2004, p.65-66)

x1 <- c(131.37, 132.37, 134.47, 135.50, 136.17)
x2 <- c(133.60, 132.70, 133.80, 132.30, 130.33)
x3 <- c(99.17, 99.07, 96.03, 94.53, 93.50)
x4 <- c(50.53, 50.23, 50.57, 51.97, 51.37)

#size (n x p) #Means 
x <- cbind(x1, x2, x3, x4) 

#size (p x p) #Variances and Covariances
Cov <- matrix(c(21.112,0.038,0.078,2.01, 0.038,23.486,5.2,2.844, 
            0.078,5.2,24.18,1.134, 2.01,2.844,1.134,10.154), 4, 4)

library(biotools)
Mahalanobis_Distance<-D2.dist(x, Cov)
print(Mahalanobis_Distance)
Jalles10
  • 111
  • 4
0

You could try the gendistance function in the nbpMatching package

Here's a short example modified from the help page, with two variables instead of 10:

df <- data.frame(id=1:33, val1=rnorm(33), val2=rnorm(33))
df.dist <- gendistance(df, idcol=1)
df.dist$dist

The distance matrix will have a 34th row/column-- this is for use in matching, and you can ignore it.

  • thanks for the extra info. From what i understand of the OP's description the answer should be a vector with 33*32/2 positive numbers in it... – – user603 Jan 30 '13 at 16:12
-1

Here is the code to do it:

library("MASS")
library("ICSNP")

x0<-mvrnorm(33,1:10,diag(c(seq(1,1/2,l=10)),10))
x1<-pair.diff(x0) #C-implementation.
dM<-mahalanobis(x1,colMeans(x1),var(x1))

Following Roman Luštrik's suggestion, here are more details. The OP asked for pairwise Mahalanobis distance, which are multivariate U-statistics of distance. I have first seen them mentionned in Croux et al. 94 (below equation 6.4) but i'm sure others such as Oja have explored this concept.

user603
  • 21,225
  • 3
  • 71
  • 135
  • It would be helpful if you expanded your answer with at least a short comment on your code. Why do you use it that way, why this approach is better compared to another approach... – Roman Luštrik Aug 02 '12 at 09:13
  • It would be helpful if you expanded your comment with at least a short description of what that "another approach" could be. – user603 Aug 02 '12 at 09:26
  • I leave that at your discretion. My point was that it would help others (including me) if you gave your answer some more context. – Roman Luštrik Aug 02 '12 at 13:52
  • Here is a question from [@user21060](http://stats.stackexchange.com/users/21060/user21060) who cannot leave comment at the moment: Should `var(x1)` be `var(x0)` instead as the $\sigma$ should be the variance of the original data, and I also question whether the empirical mean should be used here? – chl May 31 '13 at 11:17
  • thanks to user21060 for the questions. The answers are no and yes. both should be for x1. I'm correcting the oversight now. – user603 May 31 '13 at 11:19
  • 1
    I have posted some theory giving the rationale behind the following code at http://stats.stackexchange.com/questions/65705/pairwise-mahalanobis-distance/66325#66325 fastPwMahal = function(x1,invCovMat) { SQRT = with(svd(invCovMat), u %*% diag(d^0.5) %*% t(v)) dist(x1 %*% SQRT) } At the link above I have also shown that the (currently top-voted) solution using the ICSNP package on this forum seems to be incorrect. I joined stack exchange to answer this question, but initially did not have enough reputation to do so. I'm hoping a mod can arrange for these threads to be merged. – ahfoss Aug 02 '13 at 05:00
  • Please don't read anything into an answer that gets only 1 upvote, especially in a thread that was migrated here. Migrations tend to go unnoticed and therefore may remain unchecked for a long time. – whuber Jan 08 '14 at 21:47
  • @ahfoss: you are confusing Mahalanobis distance (result of the mahalanobis() function in R) with *pairwise* Mahalanobis distance. The OP asked for the latter while the rest of the answers decided to interpret this to mean the former. In any case, these are two different estimators (except when the underlying data is symmetric about its mean). – user603 Apr 30 '19 at 05:54