I know some analysis exists for calculating k for kmeans or kmediods but they dont seem to be rigrous enough if i only care so much about k not what are in the clusters. Is there a rigorous process/algorithm to estimate number of clusters in my data ?
-
1Check this intruduction, too https://stats.stackexchange.com/a/358937/3277. – ttnphns Feb 29 '20 at 08:32
1 Answers
Yes, and it's a very well-developed field. The approach for estimating the optimal number of clusters in a data set is called "cluster validity."
See:
N. Speer, C. Spieth, and A. Zell. Biological cluster validity indices based on the gene ontology. Lecture Notes in Computer Science, 3646:429--439, 2005.
D. Davies and D. Bouldin. A cluster separation measure. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1(2):224--227, 1979.
J. Dunn. Well separated clusters and optimal fuzzy partitions. J. Cybernetics, 4:95--104, 1974.
P. Rousseuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics, 20:53--65, 1987.
M. Gonzalez~Toledo. A comparison in cluster validation techniques. Master's thesis, University of Puerto Rico - Mayaguez Campus, 2005.
N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Process., 83(4):825--833, 2003.