What is the optimal number of clusters?

Question

I am doing a cluster analysis with agglomerative hierarchical clustering on my asymmetrical binary data. For finding the number of clusters, I tried all three of the most mentioned methods (Elbow, Silhouette, and Gap stat.); however the results are not overlapping and in the case of Elbow, the visual does not look like there is a clear point (See the photo). Any suggestion on what could be an optimal k?

Since your data are binary or categorical, it is unwise to use clustering methods (what linkage did you use?) and clustering validity criteria that are based on centroids and deviations from them. So, you better dismiss Gap and Elbow. As your Silhouette suggest, as it is low, you probably don't have any clusters at all. — ttnphns, Nov 05 '21 at 08:07
A pair of links for you to read. https://stats.stackexchange.com/q/195456/3277, https://stats.stackexchange.com/q/21807/3277 — ttnphns, Nov 05 '21 at 08:13
Thanks for your comment @ttnphns! I used Ward.D for the dissimilarity matrix calculated with Jaccard dist. measure (my data is asymmetrical binary (0-absence, 1-presence) where I encoded my categorical data). I was also referring to different sources as well as literature which supported the choice. — user339807, Nov 05 '21 at 08:37
No, you are not warranted to use Ward with proximities other than euclidean distance. https://stats.stackexchange.com/q/195446/3277 — ttnphns, Nov 05 '21 at 08:40
@ttnphns, any suggestion (reference) for the linkage method for asymmetrical binary data? — user339807, Nov 05 '21 at 08:45
No, the proximity is more important than the linkage method. If you are determined that you need Jaccard, go to use, say, average linkage or complete linkage. — ttnphns, Nov 05 '21 at 08:46
(An overview of some 70 binary proximity measures with formulas can be picked on my web-page, collection "Various proximities" - macro ko_proxbin description.) — ttnphns, Nov 05 '21 at 08:50
@ttnphns, very useful source! I had a look, but was wondering if there are any suggested particular proximity measures for asymmetrical kind of binary data. I would be happy to hear your suggestion! — user339807, Nov 05 '21 at 09:19
In the document of mine there is a paragraph about "ordinal" (asymmetric) and "nominal" (symmetric) measures. Check also this answer https://stats.stackexchange.com/a/61910/3277 — ttnphns, Nov 05 '21 at 09:43

score 0 · Accepted Answer · answered Nov 05 '21 at 07:17

All those methods use different criteria, hence it's nothing strange that they can diverge. The methods already provide you some choices, its up to you to make the decision.
Elbow does not give any clear answer and it's interpretation is subjective.
In most cases, you shouldn't make the decision based only on the metrics. Compare the solutions with two or three clusters, which one makes more sense for the data? Which one is easier to interpret?
If you want a less subjective criteria, you can use model-based clustering, where likelihood-ratio tests or criteria such as AIC can be used to distinguish between the solutions.

What is the optimal number of clusters?

1 Answers1