Rule of thumb on the best k in k-means clustering

Question

The rule of thumb on choosing the best k for a k-means clustering suggests choosing $k$

$$ k \sim \sqrt{n/2} $$

$n$ being the number of points to cluster. I'd like to know where this comes from and what's the (heuristic) justification. I cannot find good sources around.

The only references I can find about this are a comment on reserchgate and this review, which does not explain it anyway.

I don't know if this particular rule is there but Gerald van Belle has a book titled Statistical Rules of Thumb published by Wiley which i believe is now in its second edition. — Michael R. Chernick, May 01 '17 at 20:53
I have a copy of the first edition of van Belle's book (2002) and found no examples regarding k means clustering. — Michael R. Chernick, May 01 '17 at 21:20
I don't remember where this comes from, found it on some old notes. There are some materials reporting it around, but found none with an explanation. — mar tin, May 01 '17 at 21:24
van Belle's second edition came out in 2008 and adds a significant amount of material. You cam find in on the amazon website or the Wiley website. The table of contents is available on the amazon site. From it I could not find a section that dealt with k means clustering. You might want to check for yourself. — Michael R. Chernick, May 01 '17 at 21:43
Related, but not a duplicate: [How to decide on the correct number of clusters?](https://stats.stackexchange.com/q/23472/1352) — Stephan Kolassa, May 02 '17 at 04:02

score 4 · Accepted Answer · answered May 02 '17 at 05:55

For the purpose of data approximation, this value can give you desired properties. Computing the pairwise distances of $\sqrt{n/2}$ takes approximately linear time, so if you want to reduce the size of your data set, this can be a value of interest.

For actual clustering, the value usually is unreasonably large.

user7351362 · Answer 2 · 2017-05-01T21:47:59.840

1

I'm not sure if there is a "best" answer to this-I could only find a few references to your rule and no underlying theory. I went through some of the Springer texts (ISLR and ELSL) here on my laptop and the chapters mention K means reference there are ways to choose k-but there is no consensus on the matter.

There is just a single reference to additional material on the subject (Hastie et al. (2009)) in ISLR. It appears that this method might begin with assigning p values to your clusters, but the details are a bit thin and I have yet to open that part up... However that might be a place to start!

edited May 01 '17 at 21:47

answered May 01 '17 at 21:39

user7351362

21
5

I went through the ELSL and found no reference to this specifically either. The point here is not about the best answer, I'd just like to understand the gist of where that estimation comes from. – mar tin May 01 '17 at 21:42
I'm sort of unfamiliar with this estimation myself! Do you recall where you have seen it? Also I made a typo in my reply. Sorry about that. I'll fix it and try to look up the reference I mentioned earlier – user7351362 May 01 '17 at 21:44
I've added to question! – mar tin May 01 '17 at 21:53
The answer is useful but as the OP suggests it does not answer the question as to the rationale for the rule that he/she gave. – Michael R. Chernick May 01 '17 at 22:01

score 1 · Answer 3 · answered Sep 28 '19 at 15:49

The references ("good sources") for the "rule of thumb":

Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979) Multivariate Analysis. Academic Press, London. ISBN 0‐12‐471252‐5. See Multivariate Analysis, p.365
Cluster Validity in Clustering Methods (PhD dissertation)
Clustering Approach in Wireless Sensor Networks Based on K-means: limitations and Recommendations
Ahmed, M. & Mahmood, A.N. Ann. Data. Sci. (2015) 2: 111. https://doi.org/10.1007/s40745-015-0035-y

Rule of thumb on the best k in k-means clustering

3 Answers3