0

so i am a newbe on K-means, i use some methods to identify the number of cluster that i can use, but i found out there is some different output on each method. 2 clusters on silhouette, 8 clusters on gap statistic and 4 clusters on NBclust's. so which is the right answer? im so confuse

  • 2
    Different internal clustering criteria often give not unanimous suggestions. The only situation when they all agree is the rare (in real life) case when clusters are very much clear and apart. Here are some info on such criteria https://stats.stackexchange.com/a/358937/3277. – ttnphns Dec 01 '21 at 11:06

2 Answers2

1

The number of clusters problem is very difficult, and many datasets allow for clusterings with different numbers of clusters that are pretty much equally legitimate. It is quite common that these different approaches give you different results.

The idea that there is just one unique true number of clusters is wrong in general. In a real application, the decision about the number of clusters should take into account information about the meaning of the data and particularly about how the clustering should be used (sometimes it is essential to have a very small number of strongly different clusters; sometimes elements of the same cluster need to be very similar to each other, which requires a larger number of clusters; sometimes clusters are interpreted as pointing to essentially different conditions that are meant to be found, whereas sometimes clustering is just used for organising the data into practically accessible homogeneous groups that make sense even if in fact there is no underlying essential difference between them).

The outcomes of the methods you used can only ever give a rough orientation, but no automatic method can be trusted to find a uniquely correct number, as such a number does not exist.

A problem with the silhouette index is that it can give a maximum at a too low number (often 2 but it could in principle also be 3 or 4) if there are more clusters that make sense, however the clusters are located in data space in such a way that they can be grouped into 2 (or 3 or 4) "superclusters" (clusters of clusters, if you want) that are strongly separated from each other. The same effect can happen with very small clusters of outliers. For this reason it's worthwhile to explore local optima of the silhouette index that occur at higher numbers of clusters.

I'm not very keen on NBclust in order to make a decision, because I think that this basically averages over different indexes that may do things that are not very compatible if you try to understand them in detail. I'd rather use indexes of which it is clearer for what reason they prefer certain clusterings (although the NBclust function can be useful for looking at several of these). The problem with the gap statistic is that this is a valid idea, but ultimately the solution depends on a number of tuning decisions which are hard to make and there is little guidance (obviously one can use the default choices of its function, but this is just one of several existing seemingly legitimate ways of running the function, see its help page).

So take into account the aim of clustering, and visualise the clusterings you obtain as "best" using these methods (and maybe some that also look like good candidates). Then you may be able to make a convincing decision, or rather say that the data support several clustering structures that may be equally valid. (There are even more methods, for example based on stability, that may give you more different results - I guess you may not be happy about this...)

Christian Hennig
  • 10,796
  • 8
  • 35
1

As Christian has mentioned already, there is no free lunch when determining the "best" number of clusters and much research has been done with different approaches to do this. If what you are looking for is just a citable index to backup your decision on the number of clusters (e.g. For a publication) then you should be good to use any index as long as:

  1. You and your collaborators understand how it works
  2. The cited work describing the index provides all arguments necessary to demonstrate that the index is valuable to evaluate a clustering partition and therefore to choose the number of clusters when grading different clusterings (in this point we can also add that the method should preferentially be published in a credible source which is peer reviewed)
  3. Using such index for your particular application makes sense (e.g. Using a validation index conceived for sphere-like or globular clusters may not be the best choice if you are using a density-based algorithm because you expect your data to make sense when clustered by density).

If you want to pick a method based on experimental evidence that it performs better than others you can always look at papers dedicated to comparing methods and try out the best methods in your data to see if it produces insightful results. For example, a relatively recent method called Validation Index using supervised Classifiers (VIC) had a paper where the authors demonstrated that VIC ranked better clusterings with number of clusters equal to the original classes of a collection of datasets compared to other indices. For details see:

J. Rodríguez, M. A. Medina-Pérez, A. E. Gutierrez-Rodríguez, R. Monroy, H. Terashima-Marín (2018). Cluster validation using an ensemble of supervised classifiers. Knowledge-Based Systems, 145, Pages 134-144, https://doi.org/10.1016/j.knosys.2018.01.010.

MikeKatz45
  • 245
  • 1
  • 8