1

I was going through a document of Western Michigan University to understand the limitations of K-means clustering algorithms. Below is the link:

https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

On slide no 33 its mentioned that K-means has problems when clusters are of different

  • Sizes
  • Densities
  • Non globular shapes

Since we explore our data and try to figure out the different groups that are present in our data through the k-means clustering algorithm, how would we know that the size of the clusters is different beforehand? We can visualize it if we have two-dimensional data but how it can be done if the data is n-dimensional? Is there any way to examine the data before proceeding to apply k-means.

Also, the explanation for the limitation is: if we have different sizes of clusters, k-means will not give the desirable clusters as it tries to partition the clusters equally. But I don't think its always the case. I had applied k-means on the following dataset with k-means++ initialization

https://archive.ics.uci.edu/ml/datasets/online+retail

It gave me clusters with highly uneven distribution of 4346, 23, 3. I think I am missing some prerequisite steps before proceeding. Please help me clear my doubts. Thanks.

Igor F.
  • 6,004
  • 1
  • 16
  • 41

1 Answers1

1

You are right: you cannot know in advance that the assumptions hold. That's one reason why some people consider clustering an ill-posed problem.

Considering cluster sizes, you are also right. Uneven distribution is likely to be a problem when you have a cluster overlap. Then K-means will try to draw the boundary approximately half-way between the cluster centres. However, from the Bayesian standpoint, the boundary should be closer to the centre of the smaller cluster.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Igor F.
  • 6,004
  • 1
  • 16
  • 41