Understanding the R stats mahalanobis() function's Output

Question

An acquaintance recommended I use the Mahalanobis distance on my data instead of Euclidean, Manhattan, etc.

I tried using the mahalanobis() function in the R stats package on a data matrix with N samples and p features, with the p features as rows and N samples as columns.

>> cov_d = cov(t(data_mx))
>> mah = mahalanobis(x = t(data_mx), center = FALSE, cov=cov_d)

When I executed the lines above, I ran into the following issue someone else posted about previously, regarding computationally singular matrix (i.e., the result of using solve() on a singular matrix), as discussed here: https://stackoverflow.com/questions/22134398/mahalonobis-distance-in-r-error-system-is-computationally-singular

When I set tol=1e-25 instead, as is recommended by one user in the post, I only get a vector back, not a patient x patient distance matrix like I expected to get.

>> cov_d = cov(t(data_mx))
>> mah = mahalanobis(x = t(data_mx), center = FALSE, cov=cov_d, tol=1e-25)
>> mah      
 PT001       PT002       PT003       PT001       PT002 
 -3.776784e+16 -3.776784e+16 -3.776784e+16 -3.776784e+16 -3.776784e+16
....
PT054         PT059        PT099        PT121        PT154 
-3.776784e+16 -3.776784e+16 -3.776784e+16 -3.776784e+16 -3.776784e+16

I'm looking for the mahalanobis distance to give me an N x N matrix back. Will this distance metric not return a matrix, but only a vector? How can you use a vector of distances? How do I know how each patient compares to each other if I don't have pairwise distances, etc?

score 2 · Answer 1 · answered Dec 11 '19 at 07:45

The Mahanalobis distance is a single real number that measures the distance of a vector from a stipulated center point, based on a stipulated covariance matrix. The only time you get a vector or matrix of numbers is when you take a vector or matrix of these distances.

From the documentation for the mahalanobis function, you can see that the function "[r]eturns the squared Mahalanobis distance of all rows in x and the vector mu = center with respect to Sigma = cov." Since your input to the function is x = t(data_mx) (i.e., the transpose of data_mx), this means that you are getting one number for each column of the data frame data_mx; each number is the Mahalanobis distance for that column of data. (Presumably your data has 154 columns in it.)

It is not clear from your question why you are expecting a matrix of Mahanalobis distances. To obtain this, you would need to compute the distance for a matrix of vectors, which would be an unusual computation (and would probably require you to program it in a loop, rather than relying on the base function). Alternatively, it is possible that you may be confusing the Mahalanobis distance with the hat matrix (these are looslely related), but it is impossible to say, since you do not tell us why you are interested in the distances.

Basically, if the Mahanalobis distance doesn't give pairwise patient distances, it is not useful as a comparison distance metric for my purposes. Because it gives a single real number distance from some abstract center point, this doesn't tell me how patients relate to each other (i.e., similarity between each patient). It just tells me how each patient relates to this abstract center point. — lrthistlethwaite, Dec 11 '19 at 18:53

score 0 · Accepted Answer · answered Dec 11 '19 at 07:05

0

It seems you are looking for pairwise.mahalanobis yet you are using mahalanobis.

The mahalanobis distance function returns the distance of each row to the center (= a vector). This is described in the documentation I linked.

answered Dec 11 '19 at 07:05

Nikolas Rieble

3,131
11
36

Understanding the R stats mahalanobis() function's Output

2 Answers2