Verifying the output of implementing internal clustering validity indexes

Question

I have implemented some internal clustering validity indexes in Java:

Simplified Silhouette.
Calinski-Harabasz (VRC).
Davies -Bouldin.
Dunn's Index.

How could I verify if my implementation is correct?

I have tested the indexes on Iris, Wine, Ionosphere, Heart, Sonar, Zoo and Glass benchmarks.

I used K-Means algorithm with different number of clusters from 2 to 8.

The problem is: I obtain the best scores in partitions with 2 clusters in most of the cases. In Zoo and Glass datasets, in which the real number of clusters is 7, only one of the indexes scores the best in the case where k=7.

If it's important to mention:

K-Means (Trickl-Cluster's Implementation) results are identical to Weka's output (tested on iris dataset).
The calculation of the centroids (means of the clusters) is almost identical to Weka's output.
The used API to perform calculations on Matrices is Colt (computing the norm, operations on matrices, distances between clusters centroids...).

What's wrong?

I would compare with some other implementation, preferably part of a published package so it is more likely to be correct. — Bitwise, Jul 27 '13 at 03:11
I have implemented all these indices (don't know what is _simplified_ Sihouette) for SPSS (see my page). You could compare it with your results, if you used SPSS. Also, there exist one or more R packages right about internal clustering criterions. — ttnphns, Jul 27 '13 at 04:48

score 6 · Answer 1 · answered Jul 27 '13 at 09:50

Class labels aren't the same as clusters.

If you look at e.g. the iris data set, it's fairly obvious that the best solution will have just 2 clusters, not three. Plot the unlabeled data and interview some people on the number of groups they see in this data set. If you set k to three, you will often get results like this, where the wrong cluster is split.

The problem is that there is a mismatch between these measures (which measure some mathematical properties) and reality. In reality, classes may consist of multiple clusters, and classes may cluster themselves. Your data may just lack the information to clearly show the structure that someone manually annotated.

Plus, preprocessing is essential. Preprocess your data differently, and both your clustering algorithms will produce substantially different results, and the score you computed will also usually be quite different (at least for any index that is distance based!)

If you want to compare a clustering results with existing class labels, it's best to use an external evaluation measure, instead of an internal evaluation measure.

Have you tried computing your measures on the "true" clusters? I wouldn't be surprised if most of the time, the results produced by the clustering algorithms score better on each of these measures!

For clustering with Java I mostly use ELKI. It's really fast, and it has plenty of algorithms, not just the 3 standard algorithms from the 70s that everybody has. But I don't think it currently has internal evaluation (it has some 20 measures for external evaluation though). I'm sure they would appreciate if someone contributes such internal evaluation indexes! Maybe if you contribute your code there, they will help you verify the implementation. Maybe they also have some of the measures also implemented somewhere already, and I just didn't find them.

@Anony-Mousse-is there a "standard" preprocessing procedure to follow before applying clustering algorithms? I mean, something different from a purification of the data and standardization (if this makes sense). If yes, have you any reference? Thanks! — Avitus, Aug 26 '13 at 09:39
For some methods such as k-means, whitening seems to be a good idea (as it standardizes variance). For any distance based algorithm, preprocess and evaluate your distance metric. If the distance is crap, the clustering will be crap. If the distance is very good, the clustering will be good, too. **It is essential to look at your data and visualize a lot**. — Has QUIT--Anony-Mousse, Aug 26 '13 at 10:50

Verifying the output of implementing internal clustering validity indexes

1 Answers1