Automating determination of number of clusters from a kmeans cluster analysis

Question

I use kmeans for clustering a set of data. However, I have to specify the number of clusters. The problem is that sometimes I need 2 and other times I need 3 clusters.

Is there a clustering algorithm that could incorporate that feature in it?

You may find this related question on number of clusters useful: http://stats.stackexchange.com/questions/2597/what-stop-criteria-for-agglomerative-hierarchical-clustering-are-used-in-practice — Jeromy Anglim, May 02 '11 at 08:14
Just to clarify, you are looking for a feature that will automate the determination of number of clusters? or are you simply looking to batch process the running of a set of cluster analyses where number of clusters are known? What is your current manual approach to deciding whether two or three clusters is appropriate? do you wish to continue to use this rule or are you interested in other procedures for determining number of clusters? — Jeromy Anglim, May 02 '11 at 08:19
You can have a look at "[clues](http://www.jstatsoft.org/v33/i04/paper)" method. — Beta, May 02 '11 at 23:11

score 3 · Answer 1 · answered Nov 15 '12 at 00:32

3

this is a great paper to start with:

Estimating the number of clusters in a data set via the gap statistics

It's really easy to implement something similary in any language.

answered Nov 15 '12 at 00:32

msemelman

131
2

score 2 · Answer 2 · answered May 02 '11 at 08:00

2

Simplest solution: do both and then check which gives best results...

answered May 02 '11 at 08:00

Nick Sabbe

12,119
2
35
43

3

What do you define as the best result? How do you suggest this should be automated as per user2721's question? – Jeromy Anglim May 02 '11 at 08:13
1

He indicates that sometimes he 'needs' 2 and sometimes 3. So apparently, he has some criterion, which he doesn't mention. It surely is not uncommon to run clustering over a set of possible cluster numbers, and then evaluate. – Nick Sabbe May 02 '11 at 08:30
1

@Jeromy [Silhouette measure](http://en.wikipedia.org/wiki/Silhouette_(clustering)) is quite usable, at least for start; it often offers clear minimum, so it is good for automatic optimization. – May 02 '11 at 08:41

Automating determination of number of clusters from a kmeans cluster analysis

2 Answers2