usefulness of k-means clustering on high dimensional data

Question

I wonder what is the usefulness of k-means clustering in high dimensional spaces, and why it can be better (or not) than other clustering methods when dealing with high dimensional spaces.

The answer can be found here: [How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results](http://stats.stackexchange.com/a/35760/). Although that is only a particular answer to a more general question, this Q may be a duplicate. — gung - Reinstate Monica, Apr 11 '14 at 18:01

score 16 · Accepted Answer · edited Apr 13 '17 at 12:44

Is k-means meaningful at all?

See for example my answer here: https://stats.stackexchange.com/a/35760/7828

k-means optimizes variances. Is the unweighted sum of variances meaningful on your data set? Probably not. How can then k-means be meaningful? In high-dimensional data, distance doesn't work. But variance = squared Euclidean distance; so is it meaningful to optimize something of which you know it doesn't work in high-dimensional data?

For the particular problems of high-dimensional data, I recommend the following study:

Zimek, A., Schubert, E. and Kriegel, H.-P. (2012), A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analy Data Mining, 5: 363–387. doi: 10.1002/sam.11161

It's main focus is outlier detection, but the observations on the challenges of high-dimensional data apply to a much broader context. They show some simple experiments how high-dimensional data can be a problem. What I like about this study is they also show that high-dimensional data can be easy, too; it's not black and white, but you need to carefully study your data.

Useful is different. Often people use k-means not to actually discover clusters.

But to find representative objects. It's a clever way of semi-random sampling k objects that aren't too similar to be useful.

If you only need a clever way of sampling, k-means may be very useful.

This answer might be really meaningful if you show `In high-dimensional data, distance doesn't work` - elaborate it, in the specific context of clustering. It is what the OP presumably wants to hear - demonstration or proof. For example, please do show that euclidean distance becomes less meaningful in 1D-2D-3D sequence. — ttnphns, Apr 11 '14 at 19:12
Thank you for the suggestion. I've added a reference that discusses this in detail, and that I found very valueable. — Has QUIT--Anony-Mousse, Apr 12 '14 at 18:43
@Anony-Mousse thanks for answering this. And thanks to ttnphns too for the suggestion :) — info_seeker, Feb 20 '16 at 20:16

usefulness of k-means clustering on high dimensional data

1 Answers1

Is k-means meaningful at all?

Linked