21

I'm just working with the book Collective Intelligence (by Toby Segaran) and came across the Euclidean distance score. In the book the author shows how to calculate the similarity between two recommendation arrays (i.e. $\textrm{person} \times \textrm{movie} \mapsto \textrm{score})$ .

He calculates the Euclidean distance for two persons $p_1$ and $p_2$ by $$d(p_1, p_2) = \sqrt{\sum_{i~\in~\textrm{item}} (s_{p_1} - s_{p_2})^2} $$

This makes completely sense to me. What I don't really understand is why he calculates at the end the following to get a "distance based similarity":

$$ \frac{1}{1 + d(p_1, p_2)} $$

So, I somehow get that this must be the conversion from a distance to a similarity (right?). But why does the formular looks like this? Can someone explain that?

navige
  • 325
  • 1
  • 2
  • 6
  • 1
    There can be many ways to convert dissimilarities and similarities into each other - the specific formula depends on what make sense to you and for the future analysis. In that textbook the author preferred the formula you show for some reason; someone else in a different situation might choose another formula. The most _geometrically correct_ way to convert _euclidean_ distance into a similarity would follow from [cosine theorem](http://stats.stackexchange.com/a/36158/3277) under data-are-centered condition and is described [here](http://stats.stackexchange.com/a/12503/3277) in par. 1. – ttnphns Mar 23 '13 at 12:51
  • 1
    Ok! But If I understand right you don't really convert the euclidean distance into a similarity, but you just use a different function that returns you values within 0 and 1 (because of the cosine), right? I mean it seems different to me than calculating all the distances and then converting them to a similarity by e.g. interpolating between the smallest and the largest distance. Right? – navige Mar 23 '13 at 13:26
  • 1
    If you have a square symmetric matrix of squared euclidean distances and you perform "double centering" operation on it then you get the matrix of the scalar products which would be observed when you put the origin od the euclidean space in the centre of your configuration of objects. These scalar products _are_ angle-type similarities. They are much like _covariances_. They are not bound within range 0-1, they can be negative, positive, and diagonal elements are not necessarily 1. Still, they are similarities. – ttnphns Mar 23 '13 at 13:55

4 Answers4

18

The inverse is to change from distance to similarity.

The 1 in the denominator is to make it so that the maximum value is 1 (if the distance is 0).

The square root - I am not sure. If distance is usually larger than 1, the root will make large distances less important; if distance is less than 1, it will make large distances more important.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    Sorry! Square root was wrong. The author actually put it in the second formula, but left it out in the first. So it shouldn't be there – navige Mar 23 '13 at 12:20
  • 1
    Yes, but your hint with setting the maximum value to 1 makes sense! Thanks! – navige Mar 23 '13 at 12:21
6

To measure the distance and similarity (in the semantic sense) the first thing to check is if you are moving in a Euclidean space or not. An empirical way to verify this is to estimate the distance of a pair of values ​​for which you know the meaning.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
2

As you mentioned you know the calculation of Euclidence distance so I am explaining the second formula. Euclidean formula calculates the distance, which will be smaller for people or items who are more similar. Like if they are the same then the distance is 0 and totally different then higher than 0.

However, we need a function that gives a higher value. This can be done by adding 1 to the function(so you don't get a division-by-zero error and the maximum value remains 1) and inverting it. Like if distance 0 then the similarity score 1/1=1

Let say the Euclidean distance between item 1 and item 2 is 4 and between item 1 and item 3 is 0 (means they are 100% similar). These are the distance of items in a virtual space. smaller the distance value means they are near to each other means more likely to similar. Now we want numerical value such that it gives a higher number if they are much similar. So we can inverse distance value. But what if we have distance is 0 that's why we add 1 in the denominator. so similarity score for item 1 and 2 is 1/(1+4) = 0.2 and for item1 and item 3 is 1/(1+0) = 1

Jay Patel
  • 21
  • 2
  • 1
    I don't understand this answer. – Michael R. Chernick Aug 06 '18 at 05:05
  • 1
    ok let say the Euclidean distance between item 1 and item 2 is 4 and between item 1 and item 3 is 0 (means they are 100% similar). These are the distance of items in a virtual space. smaller the distance value means they are near to each other means more likely to similar. Now we want numerical value such that it gives a higher number if they are much similar. So we can inverse distance value. But what if we have distance is 0 that why we add 1 in the denominator. so similarity score for item 1 and 2 is 1/(1+4) = 0.2 and for item1 and item 3 is 1/(1+0) = 0 – Jay Patel Aug 08 '18 at 10:33
  • 1
    Maybe you are talking about some sort of distance measure but Euclidean distance follows a specific formula regarding a vector space. – Michael R. Chernick Aug 08 '18 at 18:36
  • 1
    I AM EXPLAINING why WE calculates at the end the following to get a "distance based similarity": $1/1+d(p1,p2)$ – Jay Patel Aug 11 '18 at 04:46
1

Euclidean is basically calculate the dissimilarity of two vectors, because it'll return 0 if two vectors are similar. While Cosine Similarity gives 1 in return to similarity. Somewhat the writer on that book wants a similarity-based measure, but he wants to use Euclidean. So, in order to get a similarity-based distance, he flipped the formula and added it with 1, so that it gives 1 when two vectors are similar. Go give it a check, try it with 2 vectors contain same values.