1

This is eye-opening, and the effect on KNN, for example, is easy to predict, but should the limitations of Euclidean distance in high dimension be a reason for concern in the very common application of multiple regression?

UPDATE: After all the comments it is clear that it is not the number of examples (subjects or observations or rows in the model matrix) that counts towards dimensionality: it is about the number of features (regressors, independent variables, columns in the model matrix) because no matter how many observations, in a model like $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 +\epsilon$, for example, you end up with a plane (2 dimensional).

enter image description here

Let me rephrase the question:

Why high dimensional vectors (many elements or rows of observations), $y\in \mathbb R^{\text{huge}}$, do not pose dimensionality issues, while many features (regressors or columns) do? Even if the desired model is low-dimensional, its calculation involves distances in high dimensional spaces.

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
  • 1
    Euclidean distance between what and what? // You may be interested in a question of mine from last year: https://stats.stackexchange.com/questions/492842/square-loss-for-big-data. – Dave Feb 11 '21 at 05:21
  • @Dave Have you made peace with the fact that your independent variable lives in $\mathbb R^n$, and the projection onto a hyperplane is what regression is about? The multiple observation of one dimension sounds at odds with the geometry. – Antoni Parellada Feb 11 '21 at 05:52
  • I disagree with Stephan in the comments, and I still wonder why MSE is appropriate for large data sets. Is your question the same as mine, Euclidean distance between predictions and observed values? – Dave Feb 11 '21 at 11:03
  • 1
    The concern is well-known and has a name: it goes under "multicollinearity." Except when data are collected according to designed experiments, observations in sufficiently high dimensions almost surely exhibit enough multicollinearity to be problematic. Search our site for threads on this topic, as well as threads on "variable selection," "model building," and "regularization." – whuber Feb 11 '21 at 19:24
  • @whuber The question that Dave brought up in the link he provides is really puzzling - which is true? Is a single independent variable with many observations to be considered as a single dimension, or is it multiple dimensions because it is a vector in $\mathbb R^n$? – Antoni Parellada Feb 11 '21 at 23:41
  • @whuber What you’re saying seems unrelated to the issue of $y,\hat{y}\in \mathbb{R}^{\text{huge}}$. Multicollinearity would apply to multiple predictor variables, but this matter of Euclidean distance between predictions and observations in a high dimension occurs if there are many predictors or just one predictor. What am I missing? – Dave Feb 11 '21 at 23:55
  • That appears to go against the linear transformation of $y$ giving the OLS estimate: $\hat{\beta}=(X^TX)^{-1}X^Ty$. To get that, it sure seems to me that we have to be able to think of $y\in\mathbb{R}^n$. I can see an argument that we use square loss because, under certain assumptions, we get a maximum likelihood estimate, but the goofiness of Euclidean distance in high dimensions still bothers me. – Dave Feb 12 '21 at 01:18
  • @Dave Why do you think it is against that linear transformation? The dimensions are consistent $(X^TX)^{−1}X^T$ is $\text{number of regressors}\times \text{huge}$ while $y$ is $\text{huge}\times 1$ resulting in a vector $\text{number of regressors}\times 1.$ – Antoni Parellada Feb 12 '21 at 01:33
  • Because $(X^TX)^{-1}X^T$ it’s a transformation from $n$-space (number of observations) to $p$-space (number of parameters). Unless you think of $y\in \mathbb{R}^n$, you can’t think of that as a linear transformation. – Dave Feb 12 '21 at 01:40
  • @Dave $y \text{ definitely }\in \mathbb R^n,$ but what I think Stephen was saying is that $y$ spans a $1$-dimensional sub-space in $ \mathbb R^n.$ – Antoni Parellada Feb 12 '21 at 03:07
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/119651/discussion-between-dave-and-numerically-illiterate). – Dave Feb 12 '21 at 04:19
  • 2
    @Dave The question of differences between a vector of values and a vector of predictions is purely *two* dimensional, according to Euclid, and so the dimensionality doesn't even arise as a concern. – whuber Feb 12 '21 at 14:04
  • @Dave I wonder if you would find [this](https://qr.ae/pNju9K) useful. – Antoni Parellada Feb 23 '21 at 14:08

0 Answers0