Limitations of K-Means Clustering

Question

I was going through a document of Western Michigan University to understand the limitations of K-means clustering algorithms. Below is the link:

https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

On slide no 33 its mentioned that K-means has problems when clusters are of different

Sizes
Densities
Non globular shapes

Since we explore our data and try to figure out the different groups that are present in our data through the k-means clustering algorithm, how would we know that the size of the clusters is different beforehand? We can visualize it if we have two-dimensional data but how it can be done if the data is n-dimensional? Is there any way to examine the data before proceeding to apply k-means.

Also, the explanation for the limitation is: if we have different sizes of clusters, k-means will not give the desirable clusters as it tries to partition the clusters equally. But I don't think its always the case. I had applied k-means on the following dataset with k-means++ initialization

https://archive.ics.uci.edu/ml/datasets/online+retail

It gave me clusters with highly uneven distribution of 4346, 23, 3. I think I am missing some prerequisite steps before proceeding. Please help me clear my doubts. Thanks.

score 1 · Answer 1 · edited Jul 08 '20 at 14:23

You are right: you cannot know in advance that the assumptions hold. That's one reason why some people consider clustering an ill-posed problem.

Considering cluster sizes, you are also right. Uneven distribution is likely to be a problem when you have a cluster overlap. Then K-means will try to draw the boundary approximately half-way between the cluster centres. However, from the Bayesian standpoint, the boundary should be closer to the centre of the smaller cluster.

Limitations of K-Means Clustering

1 Answers1