Validation of clustering results

Question

I have a data which contains several columns which I later reduced using a PCA algorithms to two different components. I then applied the k-means algorithms to the data.
Now, how can I verify that my data clustered well into each group? Or how do I determine misclassification rate?

For instance, using R, if I check the cluster vector say k$cluster against the labels of the data I had previously before clustering can I just draw a confusion matrix from that and assume that 1 in the clustered vector is equivalent to 1 in my labels?

col3    col2     Col1   lables                                           
123     2.32      2.50    0           
124    2.81      3.10     1     
125    2.72      3.09     2     
126    2.92      3.03     3     
127    2.32      2.95     4

Please note this is a hypothetical data; my data is way bigger than this.

You're speaking of assessing "misclassification". Do you mean that you had some classification of observations prior clustering, and now you want to compare that classification to the one given by clustering? — ttnphns, Sep 15 '11 at 06:08
Are you sure the cluster labels correspond to each other? I.e. that cluster A is cluster A in both custerings? And do you have equal number of clusters on the first place? — , Sep 15 '11 at 11:19
There is a prior classification and One of the things I am confused about is can I safely assumes that A cluster vector in 1 generated after clustering will be equal to my label 1. Or How can i know if the cluster classification corresponds a little to my previous labels prior to clustering. — persistence911, Sep 15 '11 at 11:26
For a real example of scrambled clusters, see [how-to-calculate-classification-error-rate](http://stackoverflow.com/questions/10067118/how-to-calculate-classification-error-rate) on SO. — denis, Apr 19 '12 at 15:00

score 3 · Answer 1 · answered Sep 15 '11 at 14:05

3

If you have an a priori classification into groups, you should not rely on labels being identical between the a priori classification and the one you obtained. I would start by computing the distances between the two clusterings (treating the classification as a clustering) using a metric distance between clusterings. All such metrics can typically be derived from the confusion matrix only, and hence do not depend on labels beyond their indicating commonality of grouping within a single clustering. I usually recommend Comparing clusterings by the variation of information by Marina Meila. It discusses three metrics: the main contribution of the paper, the variation of information (which is very good), the Mirkin distance (related to the Jaccard index, well known, but not so good as it is affected in a quadratical manner by cluster sizes), and the split/join distance (Meila calls it 'van Dongen' distance). Disclaimer: the last one was developed by me. It has the advantage that it is interpretable as the number of nodes that need reallocation to change one clustering or classification into the other. There are many other clustering (dis)similarity measures, but I would only recommend these three, and although popular, I would not recommend the Jaccard/Mirkin measures.

answered Sep 15 '11 at 14:05

micans

1,689
8
11

Thanks for the response . From my understanding the confusuion matrix need to know which of the labels maps to each other in the two different clusters. But In your statement "All such metrics can typically be derived from the confusion matrix only, and hence do not depend on labels beyond their indicating commonality of grouping within a single clustering" ---Please can you explain what you mean by this. – persistence911 Sep 15 '11 at 14:18
There needs to be a common system of labeling the objects or nodes, but not so for the clusters. Perhaps I misunderstood the label discusson above. So, a clustering of nodes 1-4 represented as { 1 => A, 2 => A, 3 => B, 4 => B } and another clustering { 1 => X, 2 => Y, 3 => Y, 4 => Y } can be represented as A = {1, 2}, B = {3, 4}, X = {1}, Y = {2, 3, 4}, and the column names, resp row names of the confusion matrix would be {A,B} versus {X,Y}. The [A,X] entry of the matrix would be the number of elements common to A and X (one). – micans Sep 15 '11 at 15:50
@micans, do you have any idea if there is any implementation of `VI` criteria in `R`? if yes what package? – doctorate Nov 16 '13 at 14:05
@doctorate, the igraph package supports this. – micans Nov 18 '13 at 09:34
@micans, thanks can you pls have a look at the question here http://stats.stackexchange.com/questions/77027/what-is-the-intuition-behind-the-variation-of-information-vi-metric-against-ot? – doctorate Nov 19 '13 at 17:42

score 2 · Answer 2 · answered Dec 14 '11 at 20:39

One classic approach is the adjusted Rand index, which is a chance-corrected measure of similarity between two partitions (a clustering is, after all, a partition). This is already implemented in R, in the mclust package (see here). This value of the adjusted Rand index always lies between -1 and 1, and the index is not a metric (e.g., it doesn't satisfy the triangle inequality). It has the nice property of being able to compare partitions of different sizes (i.e., clusterings containing different numbers of clusters).

Validation of clustering results

2 Answers2

Linked