Questions tagged [high-dimensional]

Pertains to a large number of features or dimensions (variables) for data. (For a large number of data points, use the tag [large-data]; if the issue is a larger number of variables than data, use the [underdetermined] tag.)

334 questions
328
votes
8 answers

Why is Euclidean distance not a good metric in high dimensions?

I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high dimensions'? I have been applying hierarchical…
103
votes
11 answers

Explain "Curse of dimensionality" to a child

I heard many times about curse of dimensionality, but somehow I'm still unable to grasp the idea, it's all foggy. Can anyone explain this in the most intuitive way, as you would explain it to a child, so that I (and the others confused as I am)…
55
votes
7 answers

Best PCA algorithm for huge number of features (>10K)?

I previously asked this on StackOverflow, but it seems like it might be more appropriate here, given that it didn't get any answers on SO. It's kind of at the intersection between statistics and programming. I need to write some code to do PCA…
dsimcha
  • 7,375
  • 7
  • 32
  • 29
37
votes
3 answers

How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?

I want to use Lasso or ridge regression for a model with more than 50,000 variables. I want do so using software package in R. How can I estimate the shrinkage parameter ($\lambda$)? Edits: Here is the point I got up to: set.seed (123) Y <- runif…
John
  • 2,088
  • 6
  • 27
  • 37
25
votes
3 answers

Should dimensionality reduction for visualization be considered a "closed" problem, solved by t-SNE?

I've been reading a lot about $t$-sne algorithm for dimensionality reduction. I'm very impressed with the performance on "classic" datasets, like MNIST, where it achieves a clear separation of the digits (see original article): I've also used it to…
23
votes
5 answers

Functional principal component analysis (FPCA): what is it all about?

Functional principal component analysis (FPCA) is something I have stumbled upon and never got to understand. What is it all about? See "A survey of functional principal component analysis" by Shang, 2011, and I'm citing: PCA runs into serious…
23
votes
1 answer

Should data be centered+scaled before applying t-SNE?

Some of my data's features have large values, while other features have much smaller values. Is it necessary to center+scale data before applying t-SNE to prevent bias towards the larger values? I use Python's sklearn.manifold.TSNE implementation…
stmax
  • 396
  • 1
  • 2
  • 11
22
votes
1 answer

Does Dimensionality curse effect some models more than others?

The places I have been reading about dimensionality curse explain it in conjunction to kNN primarily, and linear models in general. I regularly see top rankers in Kaggle using thousands of features on dataset that hardly has 100k data points. They…
20
votes
1 answer

Why is LASSO not finding my perfect predictor pair at high dimensionality?

I'm running a small experiment with LASSO regression in R to test if it is able to find a perfect predictor pair. The pair is defined like this: f1 + f2 = outcome The outcome here is a predetermined vector called 'age'. F1 and f2 are created by…
Ansjovis86
  • 455
  • 4
  • 15
19
votes
5 answers

Why is Gaussian distribution on high dimensional space like a soap bubble

In this famous post "Gaussian Distributions are Soap Bubbles" it is claimed that the distribution of the points looks like a soap bubble (where it is less dense in the center and more dense at the edge) instead of a bold of mold where it is more…
Code Pope
  • 781
  • 6
  • 17
19
votes
4 answers

Does "curse of dimensionality" really exist in real data?

I understand what is "curse of dimensionality", and I have done some high dimensional optimization problems and know the challenge of the exponential possibilities. However, I doubt if the "curse of dimensionality" exist in most real world data…
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
17
votes
1 answer

High-dimensional regression: why is $\log p/n$ special?

I am trying to read up on the research in the area of high-dimensional regression; when $p$ is larger than $n$, that is, $p >> n$. It seems like the term $\log p/n$ appears often in terms of rate of convergence for regression estimators. For…
Greenparker
  • 14,131
  • 3
  • 36
  • 80
17
votes
3 answers

Curse of dimensionality- does cosine similarity work better and if so, why?

When working with high dimensional data, it is almost useless to compare data points using euclidean distance - this is the curse of dimensionality. However, I have read that using different distance metrics, such as a cosine similarity, performs…
PyRsquared
  • 1,084
  • 2
  • 9
  • 20
17
votes
2 answers

How do I know my k-means clustering algorithm is suffering from the curse of dimensionality?

I believe that the title of this question says it all.
mathieu
  • 273
  • 1
  • 2
  • 6
15
votes
4 answers

PCA on high-dimensional text data before random forest classification?

Does it make sense to do PCA before carrying out a Random Forest Classification? I'm dealing with high dimensional text data, and I want to do feature reduction to help avoid the curse of dimensionality, but don't Random Forests already to some sort…
1
2 3
22 23