I am wondering: when clustering data using some general algorithm is there is an assumption on approximately equal sizes of the clusters? For example, in k-means as I know all clusters should have approx. equal number of samples. Does it also hold for other clustering algorithms?
-
3Really? that will be strange to know that all clusters should have approximately the same sample points. k-means only needs a distance metric and the number of means. Equality of cluster size is completely strange to me. – Chamberlain Mbah Oct 29 '15 at 22:56
-
1k-means, by itself does not do any checking on cluster sizes. It only cares about the mean of the current estimation. But I guess you can modify the algorithm and after converging to k-means, then divide the data into k partitions and assign each sample to the nearest mean, then you will get equal size clusters, but I doubt it will be a "better" clustering.. See Felipe's example on how equal sized clusters will fail.. – jeff Oct 30 '15 at 09:53
2 Answers
k-means does not care about cluster cardinalities
You are misunderstanding the common statement that k-means clusters "tend to be of the same size" (where size refers to the area, not cardinality). The latter is true to some extent, because k-means always splits the data on the middle orthogonal of the two clusters. This yields an approximately even division of the data space (at least if we ignore the infinite empty space outside your data - this is not mathematically rigorous).
However, if you have varying density in your data set (and why would you use clustering if you haven't) then two clusters of the same area do not have to have the same number of elements.
The only algorithm that I know which tries to ensure same cluster cardinality is this same-size-kmeans algorithm tutorial.

- 48,377
- 8
- 110
- 156

- 39,639
- 7
- 61
- 96
It does not hold, even in k means. Take for instance the following data:
...
...
xxxxxxxxxxxxxx
x x x x xxxxx
xxxxxxxxxx
xxxxxxxxxxx
If you run k means with 2 classes, clearly the two resulting clusters would have a different amount of elements.

- 622
- 3
- 7