5

http://en.wikipedia.org/wiki/Cluster_analysis

It states that: K-means separates data into Voronoi-cells, which assumes equal-sized clusters (not adequate here) and shows the image:

enter image description here

Question: What do they mean by size? Is it the size of the spread of the cluster or the amount of points in the cluster?

mlwida
  • 9,922
  • 2
  • 45
  • 74
Karl Morrison
  • 763
  • 2
  • 8
  • 17
  • Two other related questions - https://stats.stackexchange.com/questions/179297/assumption-of-equal-size-of-clusters-in-clustering - https://stats.stackexchange.com/questions/326685/is-it-true-that-k-means-has-an-assumption-each-cluster-has-a-roughly-equal-numb?noredirect=1&lq=1 – Karl Morrison Mar 12 '18 at 08:26

1 Answers1

4

The area of the original (occupied) data space, and to a lesser extend the area of the clusters convex hull, volume, spread. Not the cardinality.

Counterexample in 1 dimension, k=2: 1,2,3,4,5,6,7,8,9,10,100. Both kmeans clusters will cover a "similar" area of the data range (out of 1..100, values in 1..52.75 will be assigned to cluster 1, 52.75..100 will be cluster 2, but the cardinalities are 10 to 1). Note that the "spread" isn't always that similar: the last cluster has spread zero. But you could also use 1,1,1,1,1,1,1,1,1,10 as example. But this is just a rule of thumb, not a guarantee.

Remember that kmeans tries to minimize the SSQ. You usually do not improve SSQ by more evenly distributing the points, but the minimum SSQ will usually benefit from minimizing the SSQ of each part even if that means making the cardinalities different.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96