I have a distance matrix based on a normalized compression distance between files:
$$ d(x, y) = \frac{ C(xy) - \min \{ C(x), C(y) \} } {\max \{ C(x), C(y) \}} $$
Here, $C(xy)$ is the concatenation of files $x$ and $y$, and $C(\cdot)$ is the length of file, after being compressed by compressor $C$.
The idea is this. If a compressor compresses the concatenation of two files better than it compresses the files separately, it must have found some regularities that appear in both files.
This metric can be used to cluster data like genome sequences, MIDI music files, natural language, and so on. See for loads of examples: http://homepages.cwi.nl/~paulv/papers/cluster.pdf
My question is, how to cluster (visualise) this data, and how to validate the clustering?
My intuition says an average-linkage hierarchical clustering is a safe bet. But if I want to validate this with, say, a connectedness measure, I am introducing a bias with my clustering method.
There is no gold standard to compare to, so I need an internal measure. Visualisations (like this one: http://www.complearn.org/images/34mammals-unrooted.png) can, by human inspection, be somewhat validated, but I need a rigid method, something that can predict the correctness of the clustering of novel data. I am a newbie in this field, could someone point me to resources or outline a method?
Note that the authors of the paper and visualisation method I linked to, use a very expensive tree-optimalisation algorithm, which doesn't necessarily make sense. I want to see if I can get comparable results with more common clustering algorithms.