Optimal number of clusters in gene expression data

Asked Feb 28 '22 at 10:44

Active Feb 28 '22 at 10:44

Viewed 18 times

I'm clustering genes on gene expression data. Here's a hierarchically clustered heatmap using ward linkage and Euclidean distance

It clearly shows there are 5 or 6 clusters. Now when I evaluate their silhouette score on labels calculated from f_cluster, scipy. I get a decreasing curve like this

And increasing DB scores, although there is a slight dip at 4 to 5, 7 to 8 and 9 to 10

My question is : Should I take this curve as a "proof" that 5 or 8 clusters are better, even though the plot shows they are only relatively better than their neighbors? Or should I conclude that 2 clusters are best, even though heatmap shows otherwise? Why doesn't the heatmap translate to good scores on both the indices?

asked Feb 28 '22 at 10:44

Shubham Agrawal

The question is whether these clusters will validate on new data. I have my doubts. But on a practical levels the number of clusters is sometimes taken as the dimensionality with which you can relate the results to an outcome. At least when using clustering as unsupervised learning (data reduction). – Frank Harrell Feb 28 '22 at 12:25
please read attentively this warning answer https://stats.stackexchange.com/a/63549/3277 – ttnphns Mar 01 '22 at 22:17

Optimal number of clusters in gene expression data

0 Answers0