K-means with high dimensional data

Question

I read in many places that k-means clustering algorithm does not perform well when dealing with multidimensional binary data (so vectors whose entries are zero or one).

Intuitively, it is pretty easy to understand why: in a 1000 dimensional space, all the points have a similar distance, and k-means is a distance based method.

I am wondering if there is any study/paper that proves exactly this, or where there behavior of k-means in this setting is extensively studied.

Conceptually yes, but I am looking for a reference that can be cited in a paper. So something "official". — Ulderique Demoitre, Feb 04 '17 at 16:49
Well, you have a point there. Then again, you can try [citing CV or other SE threads](http://meta.stackexchange.com/q/49760/256777). I have done so in a paper. — Stephan Kolassa, Feb 04 '17 at 16:52
My reading of the answers in that apparent duplicate is that the situation is complex, contingent, and varied, which suggests a thoughtful analysis of your situation might be more useful than any citation could be. — whuber, Feb 04 '17 at 17:29
It's not as if k-means would work in low-dimensional binary data. Such data just does not cluster in the usual concept of "more dense regions". K-means requires continuous variables to make most sense - just as the mean. so it's not so much about the high dimensionality, but about applying the mean to non-continuous variables. — Has QUIT--Anony-Mousse, Feb 04 '17 at 18:43

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

I think you should look at this answer before thinking about curse of dimensionality : k means with binary variables

Substantially the problem is which kind of k-means algorithm are you using?. If the euclidean distance is computed you're in the wrong direction: doesn't make any sense computing euclidean distance between binary or categorial variables!

Anyway maybe you used a special distance between observations and a special "measure of centrality" for centroids, but you didn't specified it.

K-means with high dimensional data

1 Answers1