9

I wonder what is the usefulness of k-means clustering in high dimensional spaces, and why it can be better (or not) than other clustering methods when dealing with high dimensional spaces.

cpumar
  • 193
  • 1
  • 1
  • 4
  • 1
    The answer can be found here: [How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results](http://stats.stackexchange.com/a/35760/). Although that is only a particular answer to a more general question, this Q may be a duplicate. – gung - Reinstate Monica Apr 11 '14 at 18:01

1 Answers1

16

Is k-means meaningful at all?

See for example my answer here: https://stats.stackexchange.com/a/35760/7828

k-means optimizes variances. Is the unweighted sum of variances meaningful on your data set? Probably not. How can then k-means be meaningful? In high-dimensional data, distance doesn't work. But variance = squared Euclidean distance; so is it meaningful to optimize something of which you know it doesn't work in high-dimensional data?

For the particular problems of high-dimensional data, I recommend the following study:

Zimek, A., Schubert, E. and Kriegel, H.-P. (2012), A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analy Data Mining, 5: 363–387. doi: 10.1002/sam.11161

It's main focus is outlier detection, but the observations on the challenges of high-dimensional data apply to a much broader context. They show some simple experiments how high-dimensional data can be a problem. What I like about this study is they also show that high-dimensional data can be easy, too; it's not black and white, but you need to carefully study your data.

Useful is different. Often people use k-means not to actually discover clusters.

But to find representative objects. It's a clever way of semi-random sampling k objects that aren't too similar to be useful.

If you only need a clever way of sampling, k-means may be very useful.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • 3
    This answer might be really meaningful if you show `In high-dimensional data, distance doesn't work` - elaborate it, in the specific context of clustering. It is what the OP presumably wants to hear - demonstration or proof. For example, please do show that euclidean distance becomes less meaningful in 1D-2D-3D sequence. – ttnphns Apr 11 '14 at 19:12
  • Thank you for the suggestion. I've added a reference that discusses this in detail, and that I found very valueable. – Has QUIT--Anony-Mousse Apr 12 '14 at 18:43
  • @Anony-Mousse thanks for answering this. And thanks to ttnphns too for the suggestion :) – info_seeker Feb 20 '16 at 20:16