20

I am using $k$ means clustering to cluster speaker voices. When I compare an utterance with clustered speaker data I get (Euclidean distance-based) average distortion. This distance can be in range of $[0,\infty]$. I want to convert this distance to a $[0,1]$ similarity score. Please guide me on how I can achieve this.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Muhammad
  • 331
  • 1
  • 2
  • 5

5 Answers5

23

If $d(p_1,p_2)$ represents the euclidean distance from point $p_1$ to point $p_2$,

$$\frac{1}{1 + d(p_1, p_2)}$$

is commonly used.

TrynnaDoStat
  • 7,414
  • 3
  • 23
  • 39
  • Please correct me if I am wrong, if we have $X = (x_1,x_2,x_3,...,x_t)$ and $Y = (Y_1,Y_2,Y_3,...,Y_n)$ where each $ x $ and $ y $ is of dimension $ D $. Then we can define similarity such as, $$Similarity = \frac{1}{t} \sum\limits_{i=1}^t \frac{1}{ 1+ minDistance(x_i, Y)}$$. – Muhammad Jun 23 '15 at 22:27
  • 1
    I understand that the plus 1 in the denominator is to avoid divide by zero error. But I have found that the plus one value disproportionately affects d(p1,p2) values that are greater than 1 and ultimately reduces similarity score significantly. Is there another way to do this? Maybe s = 1-d(p1,p2) – aamir23 Aug 08 '18 at 02:13
13

You could also use: $\frac{1}{e^{dist}}$ where dist is your desired distance function.

Danica
  • 21,852
  • 1
  • 59
  • 115
NeuroMorphing
  • 525
  • 2
  • 12
  • Can you please give any reference book/documentation related to this equation in which you found it? @Dougal – Justlife Jun 11 '17 at 12:55
  • @AnimeshKumarPaul I didn't write this answer, just improved its formatting. But it's frequently used as a version of e.g. a "generalized RBF kernel"; see e.g. [here](https://stats.stackexchange.com/q/64126/9964). That question concerns whether the output is a positive definite kernel; if you don't care about that, though, it at least satisfies an intuitive notion of similarity that more distant points are less similar. – Danica Jun 11 '17 at 13:02
  • @Justlife : Google for this one "encyclopedia of distances" and pick the result with the pdf document. – NeuroMorphing Jun 12 '17 at 23:04
  • Which is similar to: $sim = e^{-dist}$ – user3352632 Feb 16 '21 at 20:19
10

It sounds like you want something akin to cosine similarity, which is itself a similarity score in the unit interval. In fact, a direct relationship between Euclidean distance and cosine similarity exists!

Observe that $$ ||x-x^\prime||^2=(x-x^\prime)^T(x-x^\prime)=||x||+||x^\prime||-2||x-x^\prime||. $$

While cosine similarity is $$ f(x,x^\prime)=\frac{x^T x^\prime}{||x||||x^\prime||}=\cos(\theta) $$ where $\theta$ is the angle between $x$ and $x^\prime$.

When $||x||=||x^\prime||=1,$ we have $$ ||x-x^\prime||^2=2(1-f(x,x^\prime)) $$ and $$ f(x,x^\prime)=x^T x^\prime, $$

so

$$ 1-\frac{||x-x^\prime||^2}{2}=f(x,x^\prime)=\cos(\theta) $$ in this special case.

From a computational perspective, it may be more efficient to just compute the cosine, rather than Euclidean distance and then perform the transformation.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
8

How about a Gaussian kernel ?

$K(x, x') = \exp\left( -\frac{\| x - x' \|^2}{2\sigma^2} \right)$

The distance $\|x - x'\|$ is used in the exponent. The kernel value is in the range $[0, 1]$. There is one tuning parameter $\sigma$. Basically if $\sigma$ is high, $K(x, x')$ will be close to 1 for any $x, x'$. If $\sigma$ is low, a slight distance from $x$ to $x'$ will lead to $K(x,x')$ being close to 0.

wij
  • 1,893
  • 11
  • 18
  • 1
    Note that this answer and [@Unhandled exception's](http://stats.stackexchange.com/a/158303/9964) are very related: this is $\exp\left( - \gamma d(x, x')^2 \right)$, where that one [introducing a scaling factor] is $\exp\left( - \gamma d(x, x') \right)$, a Gaussian kernel with $\sqrt{d}$ as the metric. This will [still be a valid kernel](http://stats.stackexchange.com/a/148031/9964), though the OP doesn't necessarily care about that. – Danica Jun 23 '15 at 16:08
1

If you are using a distance metric that is naturally between 0 and 1, like Hellinger distance. Then you can use 1 - distance to obtain similarity.

Brad
  • 522
  • 4
  • 10