19

I understand what is "curse of dimensionality", and I have done some high dimensional optimization problems and know the challenge of the exponential possibilities.

However, I doubt if the "curse of dimensionality" exist in most real world data (well let's put images or videos aside for a moment, I am thinking about data such as customer demographic and purchase behavior data).

We can collect data with thousands of features but it is less likely even impossible the features can fully span a space with thousands of dimensions. This is why dimension reduction techniques are so popular.

In other words, it is very likely the data does not contain the exponential level of information, i.e., many features are highly correlated and many features satisfy 80-20 rules (many instances have same value).

In such a case, I think methods like KNN will still work reasonably well. (In most books "curse of dimensionality" says dimension > 10 could be problematic. In their demos they use uniform distribution in all dimensions, where entropy is really high. I doubt in real world this will ever happen.)

My personal experience with real data is that "curse of dimensionality" does not affect template method (such as KNN) too much and in most cases, dimensions ~100 would still work.

Is this true for other people? (I worked with real data in different industries for 5 years, never observed "all distance pairs have similar values" as described in the book.)

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 1
    Since you specifically excluded images and image analysis, I'll just put a plug in the comments by saying this field does deal with the curse of dimensionality quite regularly. It's very easy to get an overfit solution. – Ashe Jun 17 '16 at 14:14
  • 7
    Binary/dummy/one-hot encoded categorical features can easily blow up a distance-based model – shadowtalker Jun 17 '16 at 14:40
  • 2
    A colleague of mine has worked with sales of sunglasses. Quite a number of features were categorical with large numbers of possible levels (e.g., brands, materials of different parts of the glasses etc.). I was definitely afraid of the CoD, but it's always hard to say whether or not it is present in a particular dataset, plus we quite probably didn't do a lot of standard tricks, not being experts in this particular analysis type. – Stephan Kolassa Jun 17 '16 at 14:44
  • @StephanKolassa So, did distance based methods worked in the sun glass use case? – Haitao Du Jun 17 '16 at 14:57
  • Not overly well. It was not a very successful project. – Stephan Kolassa Jun 17 '16 at 15:00
  • Depends a lot on your data source and goal... As an classically trained statistician, I have dealt with some weird examples. CoD often causes trivial results in efficiency analysis (search for "curse of dimensionality" + "data envelopment analysis"). Data from standard agricultural experiments lives in a world with very small samples and a very large number of mandatory features (if those features aren't taken into account, the experiment is considered a failure), and data collection often takes decades, so you can't wait for more data points. – Lucas Gallindo Jun 17 '16 at 20:33
  • "Exponential possibilities" isn't necessarily the main concern for "curse of dimensionality". It's rather that you are trying to make smooth model, but you space is incredibly sparse. kNN distance balls will extend to weird regions, which isn't helpful. That why I saw that kNN on high dimensions easily gives you results, but they just don't generalize. – Gerenuk Jun 20 '16 at 18:23

4 Answers4

16

This paper(1) discusses the blessing of non-uniformity as a counterpoint to the curse of dimensionality. The main idea is that data are not uniformly dispersed within the feature space, so one can gain traction by identifying the ways in which the data are organized.

(1) Pedro Domingos, "A Few Useful Things to Know about Machine Learning"

Sycorax
  • 76,417
  • 20
  • 189
  • 313
7

Curse of dimensionality in machine learning is more often the problem of exploding empty space between the few data points that you have. Low manifold data can make it even worse. Here is an example setup with 10000 samples where I try to do kNN with 1 neighbor.

from numpy.random import normal
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score
import matplotlib.pyplot as plt
import numpy as np
from math import sqrt
from scipy.special import gamma

N=10000
N_broad=2
scale=20

dims=[]
precs=[]


def avg_distance(k):
    return sqrt(2)*gamma((k+1)/2)/gamma(k/2)

for dim in range(N_broad+1,30):
    clf = KNeighborsClassifier(1, n_jobs=-1)

    X_train=np.hstack([normal(size=(N,N_broad)), normal(size=(N,dim-N_broad))/avg_distance(dim-N_broad)/scale])
    y_train=(X_train[:,N_broad]>0).astype(int)
    clf.fit(X_train, y_train)

    X_test=np.hstack([normal(size=(N,N_broad)), normal(size=(N,dim-N_broad))/avg_distance(dim-N_broad)/scale])
    y_test=(X_test[:,N_broad]>0).astype(int)
    y_test_pred=clf.predict(X_test)

    prec=precision_score(y_test, y_test_pred)
    dims.append(dim)
    precs.append(prec)
    print(dim, prec)

plt.plot(dims, precs)
plt.ylim([0.5,1])
plt.xlabel("Dimension")
plt.ylabel("Precision")
plt.title("kNN(1) on {} samples".format(N))
plt.show()

You didn't like fully uniform distributions, so I've made this a 2D manifold with smaller dimensions (reduced by scale) sprinkled around the 2D plane of the first two coordinates. As it happens, one of the smaller dimensions is predictive (the label is 1 when that dimension is positive).

The precision drops rapidly with increasing dimension.kNN precision

Of course, precision=0.5 would be random guessing. With a decision surface, which is more complicating than a plane, it would get even worse.

It's like the kNN balls are too sparse to be helpful at probing a smooth hyperplane. With higher dimensions they feel increasingly more lonely.

On the other hand, methods like SVM have a global view and do much better.

Gerenuk
  • 1,833
  • 3
  • 14
  • 20
5

Consider for example time series (and images, and audio). Sensor readings (Internet of Things) are very common.

The curse of dimensionality is much more common than you think. There is a large redundancy there, but also a lot of noise.

The problem is that many people simply avoid these challenges of real data, and only use the same cherryupicked UCI data sets over and over again.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Yes, I agree. Say a sequence model $P(\mathbf X)=P(X_1)\prod_{n=2}^N P(X_n|X_{n-1})$. Markov property is a way to regularize and over come the curse of dimensionality. But time series is kind like image or video. – Haitao Du Jun 17 '16 at 20:09
  • 1
    Maybe most real world data *is* from sensors like images, video, and time series? – Has QUIT--Anony-Mousse Jun 17 '16 at 22:00
  • 2
    @hxd1011 markov property is an abstraction that may have nothing to do with real data! – Sycorax Jun 18 '16 at 03:12
0

There is a wonderful article, "Statistical Modeling: the two cultures", by Breiman. He explains the two groups of scientists who deal with data and how each of them look at "dimensionality". The answer to your question is "it depends" in which group you are. Check the paper out.

Zamir Akimbekov
  • 401
  • 1
  • 3
  • 12
  • Thanks @Zamir Akimbekov, there are great discussions [here](http://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning), and another interesting paper [here](https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf) – Haitao Du Jun 23 '16 at 13:44