2

I know some analysis exists for calculating k for kmeans or kmediods but they dont seem to be rigrous enough if i only care so much about k not what are in the clusters. Is there a rigorous process/algorithm to estimate number of clusters in my data ?

1 Answers1

2

Yes, and it's a very well-developed field. The approach for estimating the optimal number of clusters in a data set is called "cluster validity."

See:

N. Speer, C. Spieth, and A. Zell. Biological cluster validity indices based on the gene ontology. Lecture Notes in Computer Science, 3646:429--439, 2005.

D. Davies and D. Bouldin. A cluster separation measure. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1(2):224--227, 1979.

J. Dunn. Well separated clusters and optimal fuzzy partitions. J. Cybernetics, 4:95--104, 1974.

P. Rousseuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics, 20:53--65, 1987.

M. Gonzalez~Toledo. A comparison in cluster validation techniques. Master's thesis, University of Puerto Rico - Mayaguez Campus, 2005.

N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Process., 83(4):825--833, 2003.