4

The rule of thumb on choosing the best k for a k-means clustering suggests choosing $k$

$$ k \sim \sqrt{n/2} $$

$n$ being the number of points to cluster. I'd like to know where this comes from and what's the (heuristic) justification. I cannot find good sources around.

The only references I can find about this are a comment on reserchgate and this review, which does not explain it anyway.

mar tin
  • 167
  • 1
  • 9
  • I don't know if this particular rule is there but Gerald van Belle has a book titled Statistical Rules of Thumb published by Wiley which i believe is now in its second edition. – Michael R. Chernick May 01 '17 at 20:53
  • I have a copy of the first edition of van Belle's book (2002) and found no examples regarding k means clustering. – Michael R. Chernick May 01 '17 at 21:20
  • I don't remember where this comes from, found it on some old notes. There are some materials reporting it around, but found none with an explanation. – mar tin May 01 '17 at 21:24
  • 1
    van Belle's second edition came out in 2008 and adds a significant amount of material. You cam find in on the amazon website or the Wiley website. The table of contents is available on the amazon site. From it I could not find a section that dealt with k means clustering. You might want to check for yourself. – Michael R. Chernick May 01 '17 at 21:43
  • 2
    Related, but not a duplicate: [How to decide on the correct number of clusters?](https://stats.stackexchange.com/q/23472/1352) – Stephan Kolassa May 02 '17 at 04:02

3 Answers3

4

For the purpose of data approximation, this value can give you desired properties. Computing the pairwise distances of $\sqrt{n/2}$ takes approximately linear time, so if you want to reduce the size of your data set, this can be a value of interest.

For actual clustering, the value usually is unreasonably large.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
1

I'm not sure if there is a "best" answer to this-I could only find a few references to your rule and no underlying theory. I went through some of the Springer texts (ISLR and ELSL) here on my laptop and the chapters mention K means reference there are ways to choose k-but there is no consensus on the matter.

There is just a single reference to additional material on the subject (Hastie et al. (2009)) in ISLR. It appears that this method might begin with assigning p values to your clusters, but the details are a bit thin and I have yet to open that part up... However that might be a place to start!

  • I went through the ELSL and found no reference to this specifically either. The point here is not about the best answer, I'd just like to understand the gist of where that estimation comes from. – mar tin May 01 '17 at 21:42
  • I'm sort of unfamiliar with this estimation myself! Do you recall where you have seen it? Also I made a typo in my reply. Sorry about that. I'll fix it and try to look up the reference I mentioned earlier – user7351362 May 01 '17 at 21:44
  • I've added to question! – mar tin May 01 '17 at 21:53
  • The answer is useful but as the OP suggests it does not answer the question as to the rationale for the rule that he/she gave. – Michael R. Chernick May 01 '17 at 22:01
1

The references ("good sources") for the "rule of thumb":

luart
  • 111
  • 2