Clustering technique and validation for distance based on file compression

Question

I have a distance matrix based on a normalized compression distance between files:

$$ d(x, y) = \frac{ C(xy) - \min \{ C(x), C(y) \} } {\max \{ C(x), C(y) \}} $$

Here, $C(xy)$ is the concatenation of files $x$ and $y$, and $C(\cdot)$ is the length of file, after being compressed by compressor $C$.

The idea is this. If a compressor compresses the concatenation of two files better than it compresses the files separately, it must have found some regularities that appear in both files.

This metric can be used to cluster data like genome sequences, MIDI music files, natural language, and so on. See for loads of examples: http://homepages.cwi.nl/~paulv/papers/cluster.pdf

My question is, how to cluster (visualise) this data, and how to validate the clustering?

My intuition says an average-linkage hierarchical clustering is a safe bet. But if I want to validate this with, say, a connectedness measure, I am introducing a bias with my clustering method.

There is no gold standard to compare to, so I need an internal measure. Visualisations (like this one: http://www.complearn.org/images/34mammals-unrooted.png) can, by human inspection, be somewhat validated, but I need a rigid method, something that can predict the correctness of the clustering of novel data. I am a newbie in this field, could someone point me to resources or outline a method?

Note that the authors of the paper and visualisation method I linked to, use a very expensive tree-optimalisation algorithm, which doesn't necessarily make sense. I want to see if I can get comparable results with more common clustering algorithms.

What do you mean when you say "validate"? The word can have various meanings. For example, to select the best cluster solution between many produced with different number of clusters or different methods. Another meaning is cross-validation by other samples. Etc. — ttnphns, May 18 '15 at 10:10
I would like some kind of stress index that gives a high number for randomly generated data, and a low index for data that can be very obviously clustered. — Geert, May 18 '15 at 11:07
Well, then you probably mean the 1st meaning. There are plenty of "internal clustering validity criterions/indices" to select from. Gap, Calinski-Harabasz, Sihouette, correlation (search this site, Wikipedia, Google, my own page, to read more). Note that your distance measure is unlikely to be de-facto euclidean or even metric. That has bearing both to the choice of a clustering method and to the choice of a criterion. See also this "reminder list" about _hierarchical clustering_ http://stats.stackexchange.com/a/63549/3277. — ttnphns, May 18 '15 at 11:33
When I said "clustering criterions" I meant those ones which test the clustering solutions already proposed (to select the "best"), not the unclustered yet data. To test whether the data points are randomly uniformly or unimodally distributed you should use appropriate multivariate distributional tests; this task is not directly tied with clustering. — ttnphns, May 18 '15 at 11:40

Clustering technique and validation for distance based on file compression

0 Answers0