Methods for evaluation of clustering

Question

I have labeled data set (with only 2 classes) and I'm trying different clustering algorithms with different variations of similarity measures (which creates different distance matrixes that I give as input to the different clustering algorithms).

In order to select the best pair of clustering algorithm and similarity function I'm comparing the labels with the clustering results. I read about Adjusted rank index, Normalized mutual information and Adjusted mutual information. ARI make sense to me, however, I'm not sure what actually the two others are doing.

Can you please explain the motivation behind the three? What are the differences?
Which one is recommended to use in which scenario?

One metric that isn't leveraged often enough wrt cluster solutions is out-of-sample fit. In other words all of the metrics mentioned are based on *calibration* information and, in that sense, are optimistically biased. That said, Lachenbruch was perhaps the first to suggest leave-one-out cross-validation for use in linear discriminant analysis. There's no reason LOOCV can't be generalized to cluster comparison as well. — Mike Hunter, Apr 18 '18 at 13:57
Related: [How to select a clustering method? How to validate a cluster solution (to warrant the method choice)?](https://stats.stackexchange.com/q/195456/) — gung - Reinstate Monica, Apr 18 '18 at 21:19

score 0 · Answer 1 · answered Apr 18 '18 at 19:54

0

These measure try to quantify the agreement of two clusterings using Shannon information.

Roughly, how many bits information of the second clustering to you learn from every bit of information you get to know of the first clustering.

answered Apr 18 '18 at 19:54

Has QUIT--Anony-Mousse

39,639
7
61
96

thanks, I still not completely understand the difference between the three. It sounds to me as all three goal is the same (how similar two clustering solutions) but technically each one calculated a bit different and results might differ in the final conclusion. meaning, I should just pick up one to go with when comparing. but which one is the most common? – Userrrrrrrr Apr 19 '18 at 06:18
Yes, same goal. But NMI uses Shannon information, Rand doesn't. So make sure you have understood information. Adjustment is just a rescaling to get 0 for random results. – Has QUIT--Anony-Mousse Apr 19 '18 at 06:27

score 0 · Accepted Answer · answered Apr 18 '18 at 21:15

0

If the labels are known then you can evaluate the clustering procedure with Silhouette Coefficient. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. The silhouette can be calculated with any distance metric.

answered Apr 18 '18 at 21:15

Christos Karatsalos

514
5
8

thanks, what about the metrics I mentioned? when calculated against the labels of course (how similar is my clustering to the ground truth labeling) - are those good measures also? – Userrrrrrrr Apr 19 '18 at 06:15
Adjusted Rand Index is a measure of the similarity between two clusterings and it is applicable even when class labels are not used. At the case you are describing it has meaning if you compare the clustering you are taking after perfoming an algorithm as k-means, with the cluster formed according to the true labels. Mutual Information measures the agreement of the two clusterings, ignoring permutations. Adjusted Mutual Information is normalized against chance. ARI and AMI although they are measures of similarity against chance, they are not equivalent since they use different theory. – Christos Karatsalos Apr 19 '18 at 20:09
Thanks, this is what I'm doing exactly.regarding "Mutual Information measures the agreement of the two clusterings, ignoring permutations" - what do you mean by ignoring permutations? can you explain a bit? thanks – Userrrrrrrr Apr 21 '18 at 12:04

Methods for evaluation of clustering

2 Answers2