3

I have two datasets:

  1. a $1*m$ matrix of "ideal" conditions for $m$ factors
  2. a $n*m$ matrix of $n$ observations (rows) for each of the $m$ factors

I would like to calculate for each observation in the second matrix, how far it is from the "ideal" condition. So the output would be $n$ values that represent "distances" from ideal conditions.

First question, is the Mahalanobis distance appropriate to use here? The $m$ factors are spatial in nature, and are related to each other.

Second question, how do I set this up in R? I have tried a few examples with mahalanobis(), mahalanobis.dist(), and pairwise.mahalnobis(), but I cannot see how these can be used with my example. When I've tried to use my matrices with these functions, I get an error:

Error in solve.default(cov, ...) : 
  Lapack routine dgesv: system is exactly singular: U[8,8] = 0

Which I have come to understand means that one of my matrices is singular and therefore cannot be inverted. I am not entirely sure how to get around this issue, or if it needs to be gotten around at all for my purposes.

I have also read this thread, but it's a bit over my head.

The overall goal of this is to use the results to map out "ideal" habitat ranges for a particular species.

Any help, thoughts, or suggestions would be greatly appreciated!

2 Answers2

1

It sounds like a plausible context for a Mahalanobis Distance. You need to be able to specify or estimate a covariance matrix, Sigma, for your m factors. It sounds like your n*m matrix is a sample of data. If it's a reasonable sample for estimating Sigma and n > m so that Sigma will be invertible, you are in business. Example code below.

## make some fake data, akin to your n*m matrix, with n > m
library(MASS)
TrueSigma <- matrix(c(10,3,2,1,3,9,2,1,2,2,8,1,1,1,1,7),4,4)
Mu <- 4:1
FakeData <- mvrnorm(n = 5, Mu, TrueSigma)

## specify ideal means, akin to your 1*m matrix
IdealMu <- c(3,3,2,0)

## calculate Mahalanobis distance for row 3
SigmaInv <- solve( var(FakeData) )
(FakeData[3,]-IdealMu) %*% SigmaInv %*% (FakeData[3,]-IdealMu)
0

What this means is that there is little to no variation for one of your variables. My recommendation would be - use an if-statement.

If length(unique(df$value))>1, then do the mahalanobis_dance, else do something else.

Hope that makes sense.

BlackHat
  • 103
  • 1
  • 3