Forgiving measure for external cluster validation

Question

I'm using external validation to measure the success of a clustering algorithm. I don't consider my categories to be definite, so I'm looking for a measure that is forgiving to the following extent:

If two clusters are merged into one, then this shouldn't be unduly penalised as it is still a good match (but still to be penalised to some extent to prevent the algorithm being pushed towards generating huge clusters)
If one cluster is split, into two, then this shouldn't be unduly penalised
Suppose there are two clusters, A and B. Suppose half of A is put into A2, half of B into B2 and the other half of A and B are combined into C. This kind of alternative categorisation should be penalised, more so than the first two occurrences, but not unduly as it could quite possibly represent another possible valid classification

What is a good measure for this kind of cluster validation?

This is related to my previous question

score 1 · Answer 1 · answered Oct 09 '13 at 07:46

The problem is that you then need to define what should be penalized, too!

Removing objects and adding objects are extreme cases of a split or merge operation.

I'm totally in line with your motivation, but I don't know any measure that satisfies this intuitive wishlist. When I first read on "edit distances", it sounded like a good approach. But the edit distances I found only edit one object at a time, so they won't satisfy this either.

Maybe the compression based measures work for you.

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

Note that most distances can be expressed as a pair of numbers (d1, d2), where d1 and d2 are the distances between each of the clusterings and the greatest common subclustering of the two clusterings. If either d1 or d2 is small (compared to, say, its twin number and the number of nodes in the universe) this means that one of the clusterings is nearly a subclustering of the other. This still implies or allows consistency, and goes some way to what you aim for. This does not identify the case where half of (clustering) C1 is a subclustering of half of C2 and the other half of C2 is a subclustering of C1. However, this would be a quite puzzling situation and the two clusterings are rather contradictory in terms of granularity. Another thing to be aware of is that there are two other distances, the variance of information, and the split/join distance that in my opinion are more informative and less biased than the commonly used distances. See also this question and its answers:

Comparing clusterings: Rand Index vs Variation of Information

Forgiving measure for external cluster validation

2 Answers2

Linked