2

I just want to know the reason why the curse of dimensionality mostly affects the non-parametric approaches compared with parametric ones.

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
olad uhg
  • 21
  • 1
  • 4
    Reference, please – Dave Oct 19 '21 at 00:06
  • @Dave from the book An Introduction to Statistical Learning with Applications in R, I quote "When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that nonparametric approaches often perform poorly when p is large" – olad uhg Oct 19 '21 at 00:32
  • 1
    Could you please edit that (useful!) information into the original question? – Dave Oct 19 '21 at 00:37
  • The problem in this case is the locality. Curse of dimensionally makes locality infinite in size. All observations become equally for away from any point because the universe collapses into a point – Aksakal Oct 19 '21 at 02:37
  • 1
    Here's an intuitive explanation: A parametric method has a constrained solution set. So adding new dimensions increases the amount of solutions much less than for non-parametric methods, which in turn makes them less prone to the curse of dimensionality. – drmaettu Oct 19 '21 at 07:47
  • 1
    Rather than _nonparametric_ it would be slightly better to refer to constrained vs. unconstrained models. There are parametric methods that suffer the curse of dimensionality, e.g., a regression model with all possible two-way interactions. – Frank Harrell Oct 19 '21 at 12:34

1 Answers1

2

I like the intuitive answer given by @drmaettu and want to give another, in my opinion, heuristic, and intuitive answer. Suppose that $X_1, X_2, \dots, X_n$ forms an independent and identically distributed sample from some $d$-variate probability distribution with unknown density $f$. The objective is to estimate $f$ based on the observed sample.

If we just assume that $f$ is, say, twice continuously differentiable, all information about the shape, curvature, etc. of the function has to be deduced from the sample. A good estimate thus requires the space to be adequately populated as we know essentially nothing about the function. However, the number of required data points to adequately populate a space increases exponentially with the number of dimensions.¹

On the other hand, if we assume that $f(x) = (2\pi)^{-\frac{d}{2}}\exp\left(-\frac 12\Vert x-\theta\Vert_2^2\right)$, where $\theta\in\mathbb R^d$ is unknown, then we effectively know how the function looks like at every point $x\in\mathbb R^d$ with regard to its shape, curvature, etc. That is, much less information has to be deduced from the sample.

Consequently, the optimal rate of convergence of nonparametric estimators depends on $d$, while the rate of optimal rate of convergence for parametric estimators is $\mathcal O(n^{-\frac12})$.


¹ Consider, for instance, the unit interval $[0,1]$. Let's assume why we fix a grid of 100 evenly spaced points in $[0,1]$. To ensure a maximum distance between two adjacent points in the unit square $[0,1]^2$ would require a grid with 100² evenly spaced points. In the unit cube $[0,1]^3$ a grid with 100³ evenly spaced points would be necessary to ensure a maximum distance of 0.01 between two adjacent points.

lmaosome
  • 140
  • 8