16

In Elements of Statistical Learning, a problem is introduced to highlight issues with k-nn in high dimensional spaces. There are $N$ data points that are uniformly distributed in a $p$-dimensional unit ball.

The median distance from the origin to the closest data point is given by the expression:

$$d(p,N) = \left(1-\left(\frac{1}{2}\right)^\frac{1}{N}\right)^\frac{1}{p}$$

When $N=1$, the formula breaks down to half the radius of the ball, and I can see how the closest point approaches the border as $p \rightarrow \infty$, thus making the intuition behind knn break down in high dimensions. But I can't grasp why the formula has a dependence on N. Could someone please clarify?

Also the book addresses this issue further by stating: "...prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them". This seems like a profound statement, but I can't seem to grasp what it means. Could anyone reword?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
user64773
  • 163
  • 1
  • 5
  • 1
    You need to edit your displayed equation a little. Is that $\frac 1N$ exponent applicable only to that $1$ in the numerator the way it looks now, or did you want it to apply to the whole $\frac 12$? – Dilip Sarwate Jan 02 '15 at 19:39
  • 1
    It would help to distinguish the "hypersphere" (which in $\mathbb{R}^p$ is a manifold of dimension $p-1$) from the "unit ball" (which has dimension $p$). The hypersphere is the *boundary* of the ball. If, as your title says, all points are sampled from the *hypersphere*, then--by definition--they all have distance $1$ from the origin, the median distance is $1$, and all are equally close to the origin. – whuber Jan 02 '15 at 21:06
  • @DilipSarwate It is applied to the whole $\frac{1}{2}$. In the book there is an example where $N=500, p=10$ so $d(p, N) \approx 0.52$ – user64773 Jan 02 '15 at 22:48

3 Answers3

12

The volume of an $p$-dimensional hyperball of radius $r$ has a volume proportional to $r^p$.

So the proportion of the volume more than a distance $kr$ from the origin is $\frac{r^p-(kr)^p}{r^p}=1-k^p$.

The probability that all $N$ randomly chosen points are more than a distance $kr$ from the origin is $\left(1-k^p\right)^N$. To get the median distance to the nearest random point, set this probability equal to $\frac12$. So $$\left(1-k^p\right)^N=\tfrac12 $$ $$\implies k=\left(1-\tfrac1{2^{1/N}}\right)^{1/p}.$$

Intuitively this makes some sort of sense: the more random points there are, the closer you expect the nearest one to the origin to be, so you should expect $k$ to be a decreasing function of $N$. Here $2^{1/N}$ is a decreasing function of $N$, so $\tfrac1{2^{1/N}}$ is an increasing function of $N$, and thus $1-\tfrac1{2^{1/N}}$ is a decreasing function of $N$ as is its $p$th root.

Henry
  • 30,848
  • 1
  • 63
  • 107
  • Ah, nice way of looking at it. Would you be able to reinterpret the quote in my second question? – user64773 Jan 02 '15 at 23:09
  • I suspect it may be suggesting that in high dimensions, points to predict are effectively a long way from the training data, as if on the edge of a sphere, so you are not really interpolating but rather extrapolating, and so uncertainties are much greater. But I do not really know. – Henry Jan 03 '15 at 01:03
  • 1
    I don't get it - I understand why this expression is the probability for all points to be farther than kr, but why does setting this probability to 1/2 gives the median distance?? – ihadanny Oct 20 '15 at 21:38
  • 1
    @ihadanny: the value $k=\left(1-\tfrac1{2^{1/N}}\right)^{1/p}$ gives the fraction of the radius where the probability all $N$ points are further away is $\frac12$, and so where the probability at least one point is closer is $1-\frac12=\frac12$, so $kr$ is the median of the distribution of the distance of the closest point. – Henry Oct 20 '15 at 23:22
  • 1
    Definition of median, half are bigger and half are smaller. – Grant Izmirlian Sep 20 '18 at 14:54
3

And now without hand waving

  1. For any sequence of i.i.d rv's, $$P( \min_{1\le i\le N} Y_i > y ) = (1-F(y))^N,$$ where $F$ is the common CDF

  2. Thus if we have $N$ i.i.d uniformly distributed $X_i$ in the unit ball in $p$ dimensions, then $$P( \min_{1\le i\le N} ||X_i|| > r ) = (1-F(r))^N,$$ where $F$ is the common CDF of the distances, $||X_i||, i=1,2,\ldots,N$. Finally, what is the CDF, $F$, for a uniformly distributed point in the unit ball in $R^p$? The probability that the point lies in the ball of radius r within the ball of unit radius equal the ratio of volumes:

$$F(r) = P ( ||X_i|| \le r ) = C r^p/( C 1^p) = r^p$$

Thus the solution to

$$1/2 = P( \min_{1\le i\le N} ||X_i|| > r ) = (1- r^p)^N$$

is

$$r = (1 - (1/2)^{1/N})^{1/p}.$$

Also your question about dependence on the sample size, $N$. For $p$ fixed, as the ball fills up with more points, naturally the minimum distance to the origin should become smaller.

Finally, there is something amiss in your ratio of volumes. It looks like $k$ should be the volume of the unit ball in $R^p$.

Grant Izmirlian
  • 251
  • 1
  • 6
0

As concise but in words:

We want to find the median distance of the closest point to the origin in $N$ uniformly distributed points in the ball at the origin of unit radius in $p$ dimensions. The probability that the smallest distance exceeds $r$, (call this quantity expression [1]) is the $N^{th}$ power of the probability that a single uniformly distributed point exceeds $r$, because of statistical independence. The latter is one minus the probability that a single uniformly distributed point is less than $r$. The latter is the ratio of volumes of the ball of radius $r$ to the ball of unit radius, or $r^p$. We can now write expression [1] as

$$ P( \min_{1\le i\le N} ||X_i|| > r ) = (1- r^p)^N.$$

To find the median of the distribution of the minimum of the distances, set the above probability to $1/2$ and solve for $r$, obtaining the answer.

Grant Izmirlian
  • 251
  • 1
  • 6