2

I read in many places that k-means clustering algorithm does not perform well when dealing with multidimensional binary data (so vectors whose entries are zero or one).

Intuitively, it is pretty easy to understand why: in a 1000 dimensional space, all the points have a similar distance, and k-means is a distance based method.

I am wondering if there is any study/paper that proves exactly this, or where there behavior of k-means in this setting is extensively studied.

  • Conceptually yes, but I am looking for a reference that can be cited in a paper. So something "official". – Ulderique Demoitre Feb 04 '17 at 16:49
  • 1
    Well, you have a point there. Then again, you can try [citing CV or other SE threads](http://meta.stackexchange.com/q/49760/256777). I have done so in a paper. – Stephan Kolassa Feb 04 '17 at 16:52
  • My reading of the answers in that apparent duplicate is that the situation is complex, contingent, and varied, which suggests a thoughtful analysis of your situation might be more useful than any citation could be. – whuber Feb 04 '17 at 17:29
  • It's not as if k-means would work in low-dimensional binary data. Such data just does not cluster in the usual concept of "more dense regions". K-means requires continuous variables to make most sense - just as the mean. so it's not so much about the high dimensionality, but about applying the mean to non-continuous variables. – Has QUIT--Anony-Mousse Feb 04 '17 at 18:43

1 Answers1

1

I think you should look at this answer before thinking about curse of dimensionality : k means with binary variables

Substantially the problem is which kind of k-means algorithm are you using?. If the euclidean distance is computed you're in the wrong direction: doesn't make any sense computing euclidean distance between binary or categorial variables!

Anyway maybe you used a special distance between observations and a special "measure of centrality" for centroids, but you didn't specified it.

jmarkov
  • 1
  • 2