Comparing a clustering algorithm partition to a "ground truth" one

Question

I have a dataset $X$. Each sample of $X$ has a label $y$ that induce a partition $P$ of $k$ subsets of $X$.

If I feed a clustering algorithm with $X$, asking for $k$ clusters I would like to obtain a partition of the samples of $X$ that is the same of that induced by $y$, that is $P$.

I want to compare the partition generated by the clustering algorithm with the ground-truth partition $P$.

To do this, I can not compare the labels $y$ with the cluster codes of a sample, as unfortunately they are totally mismatched (as the label assignment is totally arbitrary).

Is there any known technique to perform this task?

I removed word "accuracy" from the title on the grounds that clustering is not supervised classification, therefore, strictly speaking, cluster analysis makes not "mistakes". But it can be "bad" to uncover hidden ground truth partition. — ttnphns, Feb 06 '17 at 12:36
This sort of question surely appeared many times on this site. What you are doing is [_external validation_ of clustering results](http://stats.stackexchange.com/q/195456/3277). Create the `k x k` confusion frequency table (cluster partition vs ground truth partition of your data). Diagonalize the table so that maximal frequencies are in diagonal cells. Compute some measure of accuracy etc (F1, Rand,...). If the table is too big and not sparse enough to easily diagonalize it by eye, use [Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) to do that job for you. — ttnphns, Feb 06 '17 at 12:43
"Diagonalize the table so that maximal frequencies are in diagonal cells" in which sense? With an eigendecomposition? — Ulderique Demoitre, Feb 06 '17 at 12:50
No, in simple, intuitive sense. That is, do the one-to-one matching of the clusters/groups! Your question is about this, isn't it? I said to it: the matching can be done by contemplating. If hard - then use Hungarian. — ttnphns, Feb 06 '17 at 12:54

score 10 · Accepted Answer · answered Feb 06 '17 at 13:22

The Adjusted Rand index could work. It's a popular method for measuring the similarity of two ways of assigning discrete labels to the data, ignoring permutations of the labels themselves. Instead of checking whether the raw class/cluster labels match, you'd look at pairs of points and ask: to what extent are pairs in the same class assigned to the same cluster, and pairs in different classes assigned to different clusters?

To compute the Rand index, you'd measure:

$a$ = Number of pairs that have the same class label and same cluster assignment
$b$ = Number of pairs that have different class labels and different cluster assignments

The raw Rand index is:

$$RI = \frac{a + b}{\binom{n}{2}}$$

where $\binom{n}{2}$ is the number of possible pairs of points. $RI$ ranges from 0 to 1, with 1 indicating total agreement.

However, a random assignment of labels probably wouldn't produce a Rand index of zero. Therefore, it's better to use the adjusted Rand index (ARI), which makes it easier to identify this type of null result. ARI ranges from -1 to 1, where negative and near-zero values indicate chance-level labelings, positive values indicate similar labelings, and 1 indicates perfect agreement.

You can also take a look at other clustering performance metrics here. The metrics that might be useful to you are the ones that compare cluster assignments to ground truth labels (i.e. your class labels): normalized/adjusted mutual information, homogeneity/completeness/v-measure, Fowlkes-Mallows score.

Comparing a clustering algorithm partition to a "ground truth" one

1 Answers1