Questions tagged [distance]

Measure of distance between distributions or variables, such as Euclidean distance between points in n-space.

Mathematically a distance, $d$, or metric, is a function that satisfies the following properties. For two points $x, y, z$:

  1. $d(x,y) \geq 0$
  2. $d(x,y) = 0 \implies x = y$
  3. $d(x,y) = d(y,x)$
  4. $d(x,z) \leq d(x,y) + d(y,z)$

Note that certain concepts of distance in probability theory do not satisfy these properties. In particular, the KL-distance between two distributions is not symmetric, and doesn't satisfy the third property above.

Euclidean distance, Manhattan distance and Hamming distance are all common metrics.

672 questions
160
votes
9 answers

Bottom to top explanation of the Mahalanobis distance?

I'm studying pattern recognition and statistics and almost every book I open on the subject I bump into the concept of Mahalanobis distance. The books give sort of intuitive explanations, but still not good enough ones for me to actually really…
72
votes
5 answers

Intuition on the Kullback–Leibler (KL) Divergence

I have learned about the intuition behind the KL Divergence as how much a model distribution function differs from the theoretical/true distribution of the data. The source I am reading goes on to say that the intuitive understanding of 'distance'…
cgo
  • 7,445
  • 10
  • 42
  • 61
51
votes
2 answers

Choosing the right linkage method for hierarchical clustering

I am performing hierarchical clustering on data I've gathered and processed from the reddit data dump on Google BigQuery. My process is the following: Get the latest 1000 posts in /r/politics Gather all the comments Process the data and compute an…
33
votes
4 answers

Maximum Mean Discrepancy (distance distribution)

I have two data sets (source and target data) which follow different distributions. I am using MMD - that is a non-parametric distribution distance - to compute marginal distribution between the source and target data. source data, Xs target data,…
28
votes
1 answer

Earth Mover's Distance (EMD) between two Gaussians

Is there a closed-form formula for (or some kind of bound on) the EMD between $x_1\sim N(\mu_1, \Sigma_1)$ and $x_2 \sim N(\mu_2, \Sigma_2)$?
ifog
  • 381
  • 3
  • 4
28
votes
1 answer

Converting similarity matrix to (euclidean) distance matrix

In Random forest algorithm, Breiman (author) constructs similarity matrix as follows: Send all learning examples down each tree in the forest If two examples land in the same leaf increment corresponding element in similarity matrix by 1 Normalize…
Uros K
  • 467
  • 1
  • 6
  • 9
27
votes
3 answers

Distribution of difference between two normal distributions

I have two probability density functions of normal distributions: $$f_1(x_1 \; | \; \mu_1, \sigma_1) = \frac{1}{\sigma_1\sqrt{2\pi} } \; e^{ -\frac{(x-\mu_1)^2}{2\sigma_1^2} }$$ and $$f_2(x_2 \; | \; \mu_2, \sigma_2) = \frac{1}{\sigma_2\sqrt{2\pi} }…
Martijn
  • 395
  • 1
  • 4
  • 7
27
votes
1 answer

Using correlation as distance metric (for hierarchical clustering)

I would like to hierarchically cluster my data, but rather than using Euclidean distance, I'd like to use correlation. Also, since the correlation coefficient ranges from -1 to 1, with both -1 and 1 denoting "co-regulation" in my study, I am…
Megatron
  • 373
  • 1
  • 3
  • 7
26
votes
1 answer

Can the Mantel test be extended to asymmetric matrices?

The Mantel test is usually applied to symmetric distance/difference matrices. As far as I understand, an assumption of the test is that the measure used to define differences must be at least a semi-metric (meet the standard requirements of a metric…
Tom Seaton
  • 361
  • 2
  • 4
25
votes
8 answers

Perform K-means (or its close kin) clustering with only a distance matrix, not points-by-features data

I want to perform K-means clustering on objects I have, but the objects aren't described as points in space, i.e. by objects x features dataset. However, I am able to compute the distance between any two objects (it is based on a similarity…
mouse
  • 253
  • 1
  • 4
  • 4
23
votes
4 answers

Why are mixed data a problem for euclidean-based clustering algorithms?

Most classical clustering and dimensionality reduction algorithms (hierarchical clustering, principal component analysis, k-means, self-organizing maps...) are designed specifically for numeric data, and their input data are seen as points in a…
22
votes
1 answer

Link between variance and pairwise distances within a variable

Please, prove that if we have two variables (equal sample size) $X$ and $Y$ and the variance in $X$ is greater than in $Y$, then the sum of squared differences (i.e., squared Euclidean distances) between data points within $X$ is also greater than…
ttnphns
  • 51,648
  • 40
  • 253
  • 462
21
votes
4 answers

Comparing two histograms using Chi-Square distance

I want to compare two images of faces. I calculated their LBP-histograms. So now I need to compare these two histograms and get something that will tell how much these histograms are equal (0 - 100%). There are many ways of solving this task, but…
20
votes
9 answers

Pairwise Mahalanobis distances

I need to calculate the sample Mahalanobis distance in R between every pair of observations in a $n \times p$ matrix of covariates. I need a solution that is efficient, i.e. only $n(n-1)/2$ distances are calculated, and preferably implemented in…
ahfoss
  • 1,289
  • 1
  • 8
  • 22
20
votes
5 answers

How I can convert distance (Euclidean) to similarity score

I am using $k$ means clustering to cluster speaker voices. When I compare an utterance with clustered speaker data I get (Euclidean distance-based) average distortion. This distance can be in range of $[0,\infty]$. I want to convert this distance to…
Muhammad
  • 331
  • 1
  • 2
  • 5
1
2 3
44 45