What are probabilistic approaches to finding the right number of clusters?

Question

As per answers to this question, there are shortcomings in the heuristics of deciding on the number of clusters.

A more robust approach could be probability based clustering: from a probabilistic perspective, the goal of clustering is to find the most likely set of clusters given the data. Thus, we can never be "100% sure" that training instances should be placed into some cluster: they just have a certain probability of belonging to it.

I wonder how if this reasoning is correct and how it would work in practice.

@andreister wasn't my answer to that question also the answer to this question? — tdc, Feb 17 '12 at 11:12

score 8 · Accepted Answer · answered Feb 17 '12 at 10:15

There are methods to do that. A good starting point is

Rasmussen, C. E. (2000). The Infinite Gaussian Mixture Model. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12 (Vol. 12, pp. 554-560). MIT Press.

The idea is to put a Dirichlet prior on the mixture weights of a mixture of Gaussians and take the limit of infinitely many components. Since you always have have finitely many data points, it doesn't matter that you potentially have infinitely many mixtures but it allows the model to choose new clusters if it needs to.

There is a lot more work on that. A good starting point would be the publications of Yee Whye Teh.

score 0 · Answer 2 · answered Feb 23 '12 at 08:39

The first question you should then answer is:

What is a cluster?

Most of the time, a cluster is whatever the clustering algorithm finds. Which by definition then is correct.

If you run e.g. k-means, it does a good job in finding the optimal $k$ cell voronoi partitioning of the dataset. So if you are referring to k-means, the question is: what are the chances that the data set is based on $k$ Voronoi cells?

What are probabilistic approaches to finding the right number of clusters?

2 Answers2

Linked