6

People despise using Eucliean distance in higher dimensional spaces because it is not a viable metric. People argue that the distance between two vectors becomes very large as the number of dimensions increases.

But to me: it makes sense as we add more dimensions that the Eucliean distance between two points in high dimensional space becomes larger as we add more dimensions. I don't quite see the problem here.

How does inflating the distance between two vectors invalidate the metirc itself? The magnitude of the distances may be larger, but why can't it still be a viable comparison metric?

The book elements of statistical learning gives a nice picture trying to describe why Eucliean distance fails:

enter image description here

Okay - so what? The distance between two points gets larger and larger? Why does that invalidate the metric? It makes perfect sense that the distance between these two vectors is larger as we add new dimensions because they are more dissimilar to one another in the newer dimensions.

Let's think of an example where we collected lengths of a bunch of toys:

  • 5.3
  • 2.2
  • 1.2

The nearest point for a new toy 4.2 is 5.3 based on Euclidean distance. Now let's add another dimension called width of these toys.

  • <5.3, 5.6>
  • <2.2, 2.1>
  • <1.2, 0.4>

An our new point is <4.2, 0>. Now the nearest point is <2.2, 2.1>. This makes sense. Because the second dimension is widely different there. People argue that distances become less meaningful. But I can still successfully apply it here and the resulting distance makes perfect sense to me.

Anyway I don't fully understand this hatred towards Euclidean distance - it seems to make perfect sense to me!

  • "Revere"?? Surely you mean a different word. – whuber Mar 07 '16 at 17:48
  • lol - yeah, I just changed that. I guess I never actually understood what that word meant until now –  Mar 07 '16 at 17:51
  • I don't think "curse of dimensionality" is quite the relevant term here, and you have it in the title. – Richard Hardy Mar 07 '16 at 19:16
  • hm, how so? as we increase the dimensionality the euclidean metric will fail. The phenomenon of the curse. –  Mar 07 '16 at 19:28
  • "Curse of dimensionality" is a fixed expression that is used in specific settings, and I am not sure that this one fits in. – Richard Hardy Mar 07 '16 at 20:05
  • Hm - I think it does. As we increase the dimensions or features we collect and we use a nearest-neighbours predictor, the nearest neighbour will become incorrect due to failure of the similiarity metric being used to work in high dimensions. I don't think curse of dimensionality is a fixed term, in fact, i think it is the opposite: cases where methods fails to work in high dimensional data –  Mar 07 '16 at 20:13
  • That's fine with me if it is so, I only have limitted experience. – Richard Hardy Mar 07 '16 at 20:27
  • Related: http://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions/99191#99191 – Sycorax Mar 07 '16 at 20:33
  • I read that article, but it doesn't make so much sense to me. It makes the same argument –  Mar 07 '16 at 20:34
  • wow - what noob downvotes questions with true underlying beauty. this is an outrage –  Mar 09 '16 at 18:16

2 Answers2

4

I am used to an essentially same but a bit more illustrative example, in my opinion.

Let $x_1,...x_l$ be i.i.d. and uniformly distributed in the unit $n$-ball centered at the origin. Then it can be shown (I'm not writing out the derivation now, let me know if you're interested) that the median of the maximum of Euclidean distances of these points from the origin $m=\text{med}\max_l(\rho(x_1,0),...,\rho(x_l,0))$ is $$ m=\left[1-2^{-1/l}\right]^{1/n} $$ Obviously, $m\to_{n\to\infty}1$.

Now, for some intuition about the curse of dimensionality, imagine that we want to classify the point at the origin using a $kNN$ classifier (for simplicity even with $k=1$). What this formula gives us is that when the dimensionality of the feature space becomes large enough, typically the points of our training sample will "almost surely" (not exactly in the measure-theoretic sense) will be lying almost on the boundary of our unit ball and, thus, will have almost the same Euclidean distance from our point, rendering comparisons of distances to the point of interest effectively useless.

This is how I like to think about the catchphrase "In a high-dimensional space, almost all points are almost equally as distant from each other". Hope this intuition satisfies you.

EDIT Proof of the formula:

1) Let $r(x)=\rho(x, 0)$. Then the distribution function of $r$ is given by $$ F_r(t)=P(\rho(x, 0)<t)=\frac{V_n(t)}{V_n(1)}=t^n, $$ where $V_n(t)$ is the volume of an N-dimensional ball of a radius $t$.

2)Let $M(X)=\max (r(x_1),...,r(x_l))$. Then the distribution of $M$ is $$ F_M(t)=P(M<t)=1-P(M\geq{t})=1-(1-F_r(t))^l=1-(1-t^n)^l. $$

3) Now, the definition of $m$ is $F_M(m)=1/2$. Simple arithmetics now give the claim.

Vossler
  • 339
  • 1
  • 11
  • Interesting. If the distance converges to some value, how does it work in 2D and 3D? –  Mar 07 '16 at 20:11
  • Are you saying that in the infinite dimensional space, that all distances would be exactly the same? –  Mar 07 '16 at 20:15
  • Please add a link to the proof as well. That'd be awesommme –  Mar 07 '16 at 20:16
  • 1
    Almost: you don't have to consider the actual infinite dimensional space but just a value of $N$ large enough. This is essentially what the book tells you: the larger the dimensionality, the worse two points can be told apart using Euclidean distance. – Vossler Mar 07 '16 at 20:18
  • 1
    Wow - well explained. ` the larger the dimensionality, the worse two points can be told apart using Euclidean distance` I like that a lot –  Mar 07 '16 at 20:20
  • Why compute the median and not the average? –  Mar 07 '16 at 20:21
  • See my edits for the details and some small corrections. Median - because it's just extremely easy to calculate here :) Still does the job, though. No deep considerations. It doesn't really matter here, though - you can do this for the mean and/or all other quantiles than the median. – Vossler Mar 07 '16 at 20:29
  • Very nice explanation! Did you come up with this yourself or is it a well known example? –  Mar 07 '16 at 20:33
  • It was a problem for the machine learning module at my university, which was not too far ago for me :) – Vossler Mar 07 '16 at 21:01
2

There is one respect in which the Euclidean distance is not comfortable because the distance tends to increase with dimension: comparison of distances between two pairs of points when the dimension of the first pair is different than that of the second pair.

Suppose there are two points $x$ and $y$ in $\mathbb{R}^n$ and you want to calculate the distance between them. Suppose that in the beginning only the first coordinate is revealed to you, and the observed distance is $d_1=\sqrt{(x_1-y_1)^2}$. After that, another coordinate is revealed, and the observed distance becomes $d_2=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2}$. Chances are that $d_2>d_1$ even though the two points $x$ and $y$ are the same in both cases. That means you have trouble comparing the distances in different dimensions (but you still can meaningfully compare distances between different points when the dimension is fixed).

Taking, for example, the mean of coordinate-by-coordinate distance ($d=\frac{1}{n}\sum_{i=1}^n |x_i-y_i|$) could be a remedy.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • 1
    I know there already is an accepted answer, but I thought this could still give some insight. I am not working in this area so I might not understand your real problem well, though. – Richard Hardy Mar 07 '16 at 20:40