How to define number of clusters in K-means clustering?

Question

Is there any way to determine the optimal cluster number or should I just try different values and check the error rates to decide on the best value?

@berkay How do you define an error rate for this unsupervised method? (or do you mean the within SS?) — chl, Mar 31 '11 at 19:17
@chl, i can use sum of squared errors for all clusters or overall accuracy (in this case i know the class labels.) — berkay, Mar 31 '11 at 20:10
@berkay A simple algorithm for finding the No. clusters is to compute the average WSS for 20 runs of k-means on an increasing number of clusters (starting with 2, and ending with say 9 or 10), and keep the solution that has minimal WSS over this clusters set. Another method is the [Gap statistic](http://gremlin1.gdcb.iastate.edu/MIP/gene/MicroarrayData/gapstatistics.pdf). But if you already have labelled instances, then why are you trying an unsupervised method? — chl, Mar 31 '11 at 20:21
@chl thanks, good question, we can guess the clusters depending features of the intances, i'm analyzing the new intrusion characteristics, mimicry of legal applications. — berkay, Mar 31 '11 at 20:30
@berkay Sorry but I'm not very familiar with that field. Do you mean the No. features will evolve with time? — chl, Mar 31 '11 at 21:01
See also the minimum spanning tree method under [visualization-software-for-clustering](http://stats.stackexchange.com/questions/1475/visualization-software-for-clustering) — denis, Jul 19 '11 at 15:10
The minimal WSS method is wrong. As the number of clusters increasing, the value of WSS will decrease. — , Mar 14 '13 at 14:13
I've answered a similar Q with half a dozen methods (using `R`) over here: http://stackoverflow.com/a/15376462/1036500 — Ben, Mar 18 '13 at 08:21

score 8 · Accepted Answer · edited Jul 05 '12 at 16:01

The method I use is to use CCC (Cubic Clustering Criteria). I look for CCC to increase to a maximum as I increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point I take the number of clusters at the (local) maximum. This would be similar to using a scree plot to picking the number of principal components.

SAS Technical Report A-108 Cubic Clustering Criterion (pdf)

$n$ = number of observations
$n_k$ = number in cluster $k$
$p$ = number of variables
$q$ = number of clusters
$X$ = $n\times p$ data matrix
$M$ = $q\times p$ matrix of cluster means
$Z$ = cluster indicator ($z_{ik}=1$ if obs. $i$ in cluster $k$, 0 otherwise)

Assume each variable has mean 0:
$Z’Z = \text{diag}(n_1, \cdots, n_q)$, $M = (Z’Z)-1Z’X$

$SS$(total) matrix = $T$= $X’X$
$SS$(between clusters) matrix = $B$ = $M’ Z’Z M$
$SS$(within clusters) matrix = $W$ = $T-B$

$R^2 = 1 – \frac{\text{trace(W)}}{\text{trace}(T)}$
(trace = sum of diagonal elements)

Stack columns of $X$ into one long column.
Regress on Kronecker product of $Z$ with $p\times p$ identity matrix
Compute $R^2$ for this regression – same $R^2$

The CCC idea is to compare the $R^2$ you get for a given set of clusters with the $R^2$ you would get by clustering a uniformly distributed set of points in $p$ dimensional space.

There are other criteria besides CCC. Have a look at [Determining the number of clusters in a data set](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set), to see the main ones. — Vincent Labatut, Jun 04 '13 at 12:06

How to define number of clusters in K-means clustering?

1 Answers1

Linked