3

I am not sure if this question makes sense or is even ideal, but here is my problem: Many proposed distance formulas only give a distance in terms of 1-dimensions. A common distance formula is the Euclidean distance (the most obvious one). Euclidean distance $d_{ecdn}$ of two points located at $(x_1, x_2, x_3, ... , x_n)$ and $(y_1, y_2, y_3, ... , y_n)$, respectively, is given by

$d_{ecdn} = \sqrt{\displaystyle\sum_{i=1}^{n} (x_i-y_i)^2}$.

Geometrically, in any number of dimensions (but imagine 3 dimensions as an example), if you were to draw a line between the two points, whose length expresses the Euclidean distance, the line itself exists in 1 dimension. Why is this a problem? I am doing a project involve data mining and information retrieval. Basically, each document in a corpus is mapped to a location in an $n$-dimensional space where $n$ is the total number of terms within the whole set of documents. They say that if document A is geometrically closer to document B than document C, then document A is more similar to B than C. Euclidean distance gives the wrong impression of similarity since each dimension matters.

I am wondering if anyone has any good algorithms that will calculate the "distance" between two points, but with a weight on each dimensions. An example would be the Pearson Product-Moment Correlation Coefficient. This formula is given by

pearson

Sidd Singal
  • 161
  • 1
  • 6
  • Note, lines are in 1 dimension. –  Aug 22 '12 at 05:52
  • 5
    I think you are looking for [Mahalanobis Distance](http://en.wikipedia.org/wiki/Mahalanobis_distance). – TenaliRaman Aug 22 '12 at 06:27
  • 1
    1) Weighted version of euclidean distance is straightforward, see the [formula](http://stats.stackexchange.com/questions/15289/when-to-use-weighted-euclidean-distance-and-how-to-determine-the-weights-to-use). 2) Why do you think Pearson r incorporates weights? I don't see any weighting. – ttnphns Aug 22 '12 at 07:34
  • @TenaliRaman +1. That probably would have been appropriate enough for an answer, but regardless, thank you for that! – Sidd Singal Aug 22 '12 at 13:30
  • @ttnphns Wow, I didn't know there was a weighted Euclidean distance formula. I figured Euclidean distance was just a single formula, so thank you for that too! I wasn't really too positive about the Pearson, but it seemed like even though it wasn't directly weighting each dimension, there was still accountability for each dimension. – Sidd Singal Aug 22 '12 at 13:43

1 Answers1

1

Just to create an official answer for this topic (based on @TenaliRaman's comment), you want the Mahalanobis distance.

This is computed similarly to the Euclidian distance, (in fact, the Euclidian distance formula is a special case of the Mahalanobis distance formula) except each squared difference term is 'corrected' based on an inverted covariance matrix.

From the Wikipedia page:

 Mahalanobis distance is thus unitless and scale-invariant, 
 and takes into account the correlations of the data set.
KevinL
  • 111
  • 2