How can I assess how descriptive feature vectors are?

Question

I am assessing how good different features are for unsupervised classification of a set of objects. For each different feature I test, I have computed a feature vector that describes the object. I then want to get a metric out for how 'good' this vector is at separating the objects into their respective classes.

My current method for doing this is to use k-means to cluster the objects in feature space, and then use the Adjusted Rand Index to assess the quality of the clustering.

However, is there a better way to assess the 'goodness' of the feature vectors, perhaps using something like mutual information? One drawback with using k-means and the ARI is it provides no indication of the tightness of clustering.

score 5 · Accepted Answer · answered May 24 '11 at 15:05

One generally consider that a "good partitioning" must satisfy one or more of the following criteria: (a) compactness (small within-cluster variation), connectedness (neighbouring data belong to the same cluster), and spatial separation (must be combined with other criteria like compactness or balance of cluster sizes). As part of a large battery of internal measures of cluster validity (where we do not use additional knowledge about the data, like some a priori on class labeling), they can be complemented with so-called combination measures (for example, assessing intra-cluster homogeneity and inter-cluster separation), like Dunn or Davies–Bouldin index, silhouette width, SD-validity index, etc., but also estimates of predictive power (self-consistency and stability of a partitioning), how well distance information are reproduced in the resulting partitions (e.g., cophenetic correlation and Hubert's Gamma statistic). A more complete review, and simulation results, are available in

Handl, J., Knowles, J., and Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15): 3201-3212.

I guess you could rely on some of them for comparing your different cluster solutions and choose the features set that yields the better indices. You can even use bootstrap to get an estimate of the variability of those indices (e.g., cophenetic correlation, Dunn's index, silhouette width), as was done by Tom Nichols and coll. in a neuroimaging study, Finding Distinct Genetic Factors that Influence Cortical Thickness.

If you are using R, I warmly recommend taking a look at the fpc package, by Christian Hennig, which provides almost all statistical indices described above (cluster.stats()) as well as a bootstrap procedure (clusterboot()).

About the use of mutual information in clustering, I have no experience with it but here is a paper that discusses its use in a genomic context (with comparison to k-means):

Priness, I., Maimon, O., and Ben-Gal, I. (2007). Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics, 8: 111.

How can I assess how descriptive feature vectors are?

1 Answers1

Linked