I am using $k$ means clustering to cluster speaker voices. When I compare an utterance with clustered speaker data I get (Euclidean distance-based) average distortion. This distance can be in range of $[0,\infty]$. I want to convert this distance to a $[0,1]$ similarity score. Please guide me on how I can achieve this.
5 Answers
If $d(p_1,p_2)$ represents the euclidean distance from point $p_1$ to point $p_2$,
$$\frac{1}{1 + d(p_1, p_2)}$$
is commonly used.

- 7,414
- 3
- 23
- 39
-
Please correct me if I am wrong, if we have $X = (x_1,x_2,x_3,...,x_t)$ and $Y = (Y_1,Y_2,Y_3,...,Y_n)$ where each $ x $ and $ y $ is of dimension $ D $. Then we can define similarity such as, $$Similarity = \frac{1}{t} \sum\limits_{i=1}^t \frac{1}{ 1+ minDistance(x_i, Y)}$$. – Muhammad Jun 23 '15 at 22:27
-
1I understand that the plus 1 in the denominator is to avoid divide by zero error. But I have found that the plus one value disproportionately affects d(p1,p2) values that are greater than 1 and ultimately reduces similarity score significantly. Is there another way to do this? Maybe s = 1-d(p1,p2) – aamir23 Aug 08 '18 at 02:13
You could also use: $\frac{1}{e^{dist}}$ where dist
is your desired distance function.

- 21,852
- 1
- 59
- 115

- 525
- 2
- 12
-
Can you please give any reference book/documentation related to this equation in which you found it? @Dougal – Justlife Jun 11 '17 at 12:55
-
@AnimeshKumarPaul I didn't write this answer, just improved its formatting. But it's frequently used as a version of e.g. a "generalized RBF kernel"; see e.g. [here](https://stats.stackexchange.com/q/64126/9964). That question concerns whether the output is a positive definite kernel; if you don't care about that, though, it at least satisfies an intuitive notion of similarity that more distant points are less similar. – Danica Jun 11 '17 at 13:02
-
@Justlife : Google for this one "encyclopedia of distances" and pick the result with the pdf document. – NeuroMorphing Jun 12 '17 at 23:04
-
It sounds like you want something akin to cosine similarity, which is itself a similarity score in the unit interval. In fact, a direct relationship between Euclidean distance and cosine similarity exists!
Observe that $$ ||x-x^\prime||^2=(x-x^\prime)^T(x-x^\prime)=||x||+||x^\prime||-2||x-x^\prime||. $$
While cosine similarity is $$ f(x,x^\prime)=\frac{x^T x^\prime}{||x||||x^\prime||}=\cos(\theta) $$ where $\theta$ is the angle between $x$ and $x^\prime$.
When $||x||=||x^\prime||=1,$ we have $$ ||x-x^\prime||^2=2(1-f(x,x^\prime)) $$ and $$ f(x,x^\prime)=x^T x^\prime, $$
so
$$ 1-\frac{||x-x^\prime||^2}{2}=f(x,x^\prime)=\cos(\theta) $$ in this special case.
From a computational perspective, it may be more efficient to just compute the cosine, rather than Euclidean distance and then perform the transformation.

- 76,417
- 20
- 189
- 313
How about a Gaussian kernel ?
$K(x, x') = \exp\left( -\frac{\| x - x' \|^2}{2\sigma^2} \right)$
The distance $\|x - x'\|$ is used in the exponent. The kernel value is in the range $[0, 1]$. There is one tuning parameter $\sigma$. Basically if $\sigma$ is high, $K(x, x')$ will be close to 1 for any $x, x'$. If $\sigma$ is low, a slight distance from $x$ to $x'$ will lead to $K(x,x')$ being close to 0.

- 1,893
- 11
- 18
-
1Note that this answer and [@Unhandled exception's](http://stats.stackexchange.com/a/158303/9964) are very related: this is $\exp\left( - \gamma d(x, x')^2 \right)$, where that one [introducing a scaling factor] is $\exp\left( - \gamma d(x, x') \right)$, a Gaussian kernel with $\sqrt{d}$ as the metric. This will [still be a valid kernel](http://stats.stackexchange.com/a/148031/9964), though the OP doesn't necessarily care about that. – Danica Jun 23 '15 at 16:08
If you are using a distance metric that is naturally between 0 and 1, like Hellinger distance. Then you can use 1 - distance to obtain similarity.

- 522
- 4
- 10