0

I am doing a cluster analysis with agglomerative hierarchical clustering on my asymmetrical binary data. For finding the number of clusters, I tried all three of the most mentioned methods (Elbow, Silhouette, and Gap stat.); however the results are not overlapping and in the case of Elbow, the visual does not look like there is a clear point (See the photo). Any suggestion on what could be an optimal k?

enter image description here

enter image description here

  • Since your data are binary or categorical, it is unwise to use clustering methods (what linkage did you use?) and clustering validity criteria that are based on centroids and deviations from them. So, you better dismiss Gap and Elbow. As your Silhouette suggest, as it is low, you probably don't have any clusters at all. – ttnphns Nov 05 '21 at 08:07
  • A pair of links for you to read. https://stats.stackexchange.com/q/195456/3277, https://stats.stackexchange.com/q/21807/3277 – ttnphns Nov 05 '21 at 08:13
  • Thanks for your comment @ttnphns! I used Ward.D for the dissimilarity matrix calculated with Jaccard dist. measure (my data is asymmetrical binary (0-absence, 1-presence) where I encoded my categorical data). I was also referring to different sources as well as literature which supported the choice. – user339807 Nov 05 '21 at 08:37
  • No, you are not warranted to use Ward with proximities other than euclidean distance. https://stats.stackexchange.com/q/195446/3277 – ttnphns Nov 05 '21 at 08:40
  • @ttnphns, any suggestion (reference) for the linkage method for asymmetrical binary data? – user339807 Nov 05 '21 at 08:45
  • No, the proximity is more important than the linkage method. If you are determined that you need Jaccard, go to use, say, average linkage or complete linkage. – ttnphns Nov 05 '21 at 08:46
  • Very helpful! Thanks a lot @ttnphns – user339807 Nov 05 '21 at 08:47
  • (An overview of some 70 binary proximity measures with formulas can be picked on my web-page, collection "Various proximities" - macro ko_proxbin description.) – ttnphns Nov 05 '21 at 08:50
  • @ttnphns, very useful source! I had a look, but was wondering if there are any suggested particular proximity measures for asymmetrical kind of binary data. I would be happy to hear your suggestion! – user339807 Nov 05 '21 at 09:19
  • In the document of mine there is a paragraph about "ordinal" (asymmetric) and "nominal" (symmetric) measures. Check also this answer https://stats.stackexchange.com/a/61910/3277 – ttnphns Nov 05 '21 at 09:43

1 Answers1

0
  • All those methods use different criteria, hence it's nothing strange that they can diverge. The methods already provide you some choices, its up to you to make the decision.
  • Elbow does not give any clear answer and it's interpretation is subjective.
  • In most cases, you shouldn't make the decision based only on the metrics. Compare the solutions with two or three clusters, which one makes more sense for the data? Which one is easier to interpret?
  • If you want a less subjective criteria, you can use model-based clustering, where likelihood-ratio tests or criteria such as AIC can be used to distinguish between the solutions.
Tim
  • 108,699
  • 20
  • 212
  • 390