Looking for a metric to compare clustering solutions to a reference clustering for a large dataset

Question

I am looking for a metric to compare several clustering solutions to a reference clustering that is known to be "correct". Specifically I have a set of millions of genes, and I wish to compare different clustering solutions of them to a known clustering of these genes, in order to find the best performing parameters and then apply these parameters to cluster a different set of genes.

If it were a smaller set, one option I thought of is to test for every pair in the clustering solution whether it is in the same cluster or a different one, and compare this to the reference clustering. This approach does not seem practical for the size of clustering I'm dealing with, plus - I assume there are well-established methods to deal with such problems (which I was unable to find).

What would be a good method/metric to compare a clustering solution to a given reference clustering?

Thanks!

More information about pair counting methods (most of which were mentioned by Anony-Mousse), can be found in this wikipedia section: https://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation (which I've previously missed). — dudusan, May 16 '16 at 18:22

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

See my answer (micans) to Comparing clusterings: Rand Index vs Variation of Information. In short, any reasonable solution will be exceedingly fast, and standard solutions indeed use pair counting as stated by Anony-Mousse. However, pair-counting methods have a severe drawback where the distances are very much affected and distorted by the sizes of the clusters involved (examples are given if you follow the link). Other methods exist that do not suffer from this drawback.

score 0 · Answer 2 · answered May 09 '16 at 06:08

0

The standard solutions are using pair counting. It's not too expensive, much cheaper than the clustering algorithm itself.

I.e. Rand index, ARI, pair counting F1, Fowlkes-Mallows etc.

answered May 09 '16 at 06:08

Has QUIT--Anony-Mousse

39,639
7
61
96

score 0 · Answer 3 · answered Mar 08 '18 at 21:33

0

I realise this post is from a while ago, but I recently found the R package dendextend, which has some functions that may help with this.

answered Mar 08 '18 at 21:33

ramiro

179
1
9

Looking for a metric to compare clustering solutions to a reference clustering for a large dataset

3 Answers3

Linked