Validating clustering results with labeled data

Question

I am working on a clustering algorithm and would like to validate its performance against a well-known and used dataset: the KDD-CUP 99 dataset (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). With this dataset, both unlabeled and labeled test data is provided. My question is, how should I validate my clustering algorithm's performance?

Let's say the results of my algorithm are as follows:
x1 -> cluster A
x2 -> cluster A
x3 -> cluster B
x4 -> cluster A

And let's say the labels provided are as follows:
x1 -> cluster 1
x2 -> cluster 1
x3 -> cluster 1
x4 -> cluster 2

Given that the cluster labels are completely different, how should I compare these? In this case, an obvious assumption would be to say that cluster A is probably the same as cluster 1, but this may not always be this obvious. Is there any standardized way to evaluate such situations?

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

3

Look into distances between clusterings. They all use what is called the confusion matrix between two clusterings. Well known are the Rand index and the adjusted Rand index, but I generally recommend using either Variation of Information or the not well known split-join distance (see e.g. Comparing clusterings: Rand Index vs Variation of Information and How to interpret these indices/metrics for comparing partitions intuitively out of these images? for more discussion).

edited Apr 13 '17 at 12:44

Community

1

answered Nov 06 '14 at 10:09

micans

1,689
8
11

I agree, but for the KDD-CUP-99 set I have no cluster centers or anything. I only have the labels as described in my original question. Doesn't this make finding a distance between clusterings impossible? – danielvdende Nov 06 '14 at 10:12
For these distances you only need the labels. The confusion matrix is the matrix where where the (i,j) entry contains the overlap between cluster i in clustering One and cluster j in clustering Two. At no point do you use the data from which the clustering was computed. It will probably help to read up a bit more on this concept. – micans Nov 06 '14 at 10:31
Yes, but the confusion matrix assumes that you know the True Positives etc. How can I know the true positives if the labels produced by the algorithms differ? (As indicated in the question). So, for the first table here (http://en.wikipedia.org/wiki/Confusion_matrix), I have one result that shows Cat, Dog, Rabbit as labels, but another that shows Wood, Metal, Aluminium (random example). My question is: How to couple these different labels with each other? (which is needed for Rand/Jaccard/Fowlkes-Mallows Indices AND confusion matrices. – danielvdende Nov 06 '14 at 10:41
In your case you'd have 'overlap(cluster 1, cluster A) == 2' (namely the objects x1 and x2) and the confusion matrix is not symmetric. The wiki example is not so applicable here, and the application of confusion matrix in clustering is different. It is best to approach it from the literature of clustering distances (also called partition distances). It is hard to find good web references quickly: Search e.g. 'on the use of the adjusted rand index as a metric for evaluating supervised classification' - this article describes the confusion matrix (I don't really recommend the Rand index). – micans Nov 06 '14 at 11:09
Ok,thanks :). I was also looking at other research, where the cluster purity was often used. For this, each cluster was assigned a label based on the dominant label in its list of points (so each point has a label, then basically count which label is in the cluster's list most often). Would this be a good idea do you think? – danielvdende Nov 06 '14 at 11:26
Clustering distances are really nice I think, so I would have a look at them besides anything else you want to do. In the first reference in my answer above I mention consistency: a clustering can be different from the gold standard, but still e.g. be a good superclustering or good subclustering. The cluster purity is related to quantities used by clustering distances by the way. These distances have the advantage of being *nice* (e.g. satisfy the triangle inequality). Whatever you do, beware of the Rand index; it is heavily influenced by cluster sizes. – micans Nov 06 '14 at 11:37

score 1 · Answer 2 · answered Nov 09 '14 at 15:31

Be really careful with this data set.

KDD Cup '99 dataset (Network Intrusion) considered harmful

This data set does is no way resemble current network traffic. Assuming that it would indicate usefulness for detecting network intrusions is foolish.
With well-crafted methods (such as some simple IP filters), most of the attacks present in this data set can trivially be detected. On the raw data, a simple TTL is even vs. TTL is odd filter apparently is able to achieve 100% correct.
The data set has a massive amount of duplicates. If you do naive cross-validation, your results are likely overfitting, because you have duplicates in test and training sets.
This is a classification data set, not a clustering data set. Clusters and classes are not the same thing. With clustering you want to discover something new in you data, and using classification labels you actually punish if anything new was discovered...
Attributes are categorial, binary, false numerical (IP) - there are next to no continuous attributes in this data sets. Most distances on this data set are entirely meaningless.

All in all, stay away from this data set.

Thanks for your comment! I will take these issues into account (particularly #3). The reason I wish to use this set is that the clustering algorithms proposed in various scientific journals ALL use this dataset (algorithms like: CluStream, DenStream etc.). I am aware of the fact that non-numerical data is unsuitable for clustering, this is also discussed in the previously mentioned papers, and so for proper comparison I will use the same attributes. — danielvdende, Nov 10 '14 at 12:04

Validating clustering results with labeled data

2 Answers2

Linked