I am fairly new to clustering data and all the indexes and distances that exists. I have experiemented with k-means and trying to pick the optimal number of clusters by hand, used hierarchical clustering as well, and lastly looked at the Bayesian Interface Criteron.
From those experiences, k-means is fairly arbitrary and we can use the elbow method, but it is not very repetitive it seems. This is because the number of clusters might differ. Likewise for hierarchical clustering.
Since then I have found packages like NbClust
and mclust
that test to find the optimal number of clusters.
NbClust
looks at the frequency of number of clusters based on a distance
and a method.
Although I can read and understand what the distance
is being calcualted formally, I am curious to know if anybody has any good rules of thumb or guidelines for picking the distance
and method
? Also, I am curious if I really should pick the number of clusters beased on the frequency of those number of clusters, as in follow the majority rule. Secondly, the Best.partition
from NbClust. Which clustering index is that based on? Surely each index that was selected from the majority rule cannont always have the same items in each cluster?
Most of the work I have read, used BIC criteron, which is understandable based on the answer provided here. Re-quoted below:
Long answer: The purpose of using model based clustering over heuristic based clustering approaches such as k-means and hierarchical (agglomerative) clustering is to provide a more formal and intuitive approach to comparing and selecting an appropriate cluster model for your data.
My question should I use process like the Information Criteron or NbClust
? What are the pros and cons or are they as simple as the answer provided above?
In applying the BIC
method to my model, I only found 1 cluster, hence the results were not as riveting as I expected. Therefore, I explored NbClust
but now I am stuck on what distance and method I should use. How do others decide between BIC
or NbClust
or do they simply just pick one and do not bother discussing the validility of their choice?