20

Is there any way to determine the optimal cluster number or should I just try different values and check the error rates to decide on the best value?

Andy W
  • 15,245
  • 8
  • 69
  • 191
berkay
  • 530
  • 1
  • 5
  • 17
  • 1
    @berkay How do you define an error rate for this unsupervised method? (or do you mean the within SS?) – chl Mar 31 '11 at 19:17
  • @chl, i can use sum of squared errors for all clusters or overall accuracy (in this case i know the class labels.) – berkay Mar 31 '11 at 20:10
  • 3
    @berkay A simple algorithm for finding the No. clusters is to compute the average WSS for 20 runs of k-means on an increasing number of clusters (starting with 2, and ending with say 9 or 10), and keep the solution that has minimal WSS over this clusters set. Another method is the [Gap statistic](http://gremlin1.gdcb.iastate.edu/MIP/gene/MicroarrayData/gapstatistics.pdf). But if you already have labelled instances, then why are you trying an unsupervised method? – chl Mar 31 '11 at 20:21
  • @chl thanks, good question, we can guess the clusters depending features of the intances, i'm analyzing the new intrusion characteristics, mimicry of legal applications. – berkay Mar 31 '11 at 20:30
  • @berkay Sorry but I'm not very familiar with that field. Do you mean the No. features will evolve with time? – chl Mar 31 '11 at 21:01
  • @chl, exactly.. – berkay Mar 31 '11 at 21:20
  • See also the minimum spanning tree method under [visualization-software-for-clustering](http://stats.stackexchange.com/questions/1475/visualization-software-for-clustering) – denis Jul 19 '11 at 15:10
  • The minimal WSS method is wrong. As the number of clusters increasing, the value of WSS will decrease. –  Mar 14 '13 at 14:13
  • 2
    I've answered a similar Q with half a dozen methods (using `R`) over here: http://stackoverflow.com/a/15376462/1036500 – Ben Mar 18 '13 at 08:21

1 Answers1

8

The method I use is to use CCC (Cubic Clustering Criteria). I look for CCC to increase to a maximum as I increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point I take the number of clusters at the (local) maximum. This would be similar to using a scree plot to picking the number of principal components.


SAS Technical Report A-108 Cubic Clustering Criterion (pdf)

$n$ = number of observations
$n_k$ = number in cluster $k$
$p$ = number of variables
$q$ = number of clusters
$X$ = $n\times p$ data matrix
$M$ = $q\times p$ matrix of cluster means
$Z$ = cluster indicator ($z_{ik}=1$ if obs. $i$ in cluster $k$, 0 otherwise)

Assume each variable has mean 0:
$Z’Z = \text{diag}(n_1, \cdots, n_q)$, $M = (Z’Z)-1Z’X$

$SS$(total) matrix = $T$= $X’X$
$SS$(between clusters) matrix = $B$ = $M’ Z’Z M$
$SS$(within clusters) matrix = $W$ = $T-B$

$R^2 = 1 – \frac{\text{trace(W)}}{\text{trace}(T)}$
(trace = sum of diagonal elements)

Stack columns of $X$ into one long column.
Regress on Kronecker product of $Z$ with $p\times p$ identity matrix
Compute $R^2$ for this regression – same $R^2$

The CCC idea is to compare the $R^2$ you get for a given set of clusters with the $R^2$ you would get by clustering a uniformly distributed set of points in $p$ dimensional space.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Ralph Winters
  • 801
  • 5
  • 7
  • 2
    There are other criteria besides CCC. Have a look at [Determining the number of clusters in a data set](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set), to see the main ones. – Vincent Labatut Jun 04 '13 at 12:06