1

First a theoretical question. I know that natively, an hierarchical clustering algorithm is of complexity on the cube of number of samples N. This is due to the fact that in each iteration, one has to go over the entire distance matrix to find the smallest value.

But, it is possible to implement it in lower complexity (of N square). How it is done and can you refer me to an article or description of the implementation?

Finally, I used scikit library to do HClustering. It was fine but it is limited to 15,000 samples or so. I want to cluster much more. can someone refer me to an implementation that fits larger numbers (and preferably runs at square root N)?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
idoda
  • 217
  • 2
  • 10
  • Most "good" implementations start with the whole matrix and diminish it by 1 row and column at each step. This is efficient if the language is quick in cutting array portions. – ttnphns Dec 18 '13 at 12:40
  • But what is dubious is your intention to do hierarchical clustering of too many objects. Not because of speed/memory. If you are hoping to cluster thousands of objects in _few_ clusters (as usual) you must be aware that risk/amount of [suboptimality](http://stats.stackexchange.com/q/40933/3277) goes up with steps. So, last steps in hierarchical clustering of big data are likely to be relatively much worse than last steps with small data. This is the reason why classic hierarchical cluastering algorithm is not recommended for big data anyway. – ttnphns Dec 18 '13 at 12:53
  • "Diminishing" will not take you to N^2. I don't think this is the optimization done.. regarding your second comment, Im planning on calculating the thd on a small data, and setting it on big/large scale data. My output will be (hopefully, for my needs) 99% singlton clusters and only a very very small amount of clusters containing more then one sample in it. – idoda Dec 18 '13 at 14:14
  • @ttnphns I doubt that a matrix-based approach can get down to $O(n^2)$. See: most matrix operations (such as searching the maximum) need $O(n^2)$, and if you do these at each iteration, you will be in $O(n^3)$. Methods such as SLINK that work in $O(n^2)$ do also need only linear memory! – Has QUIT--Anony-Mousse Dec 21 '13 at 21:49
  • @Anony-Mousse, my first comment said nothing about O(n^2); it was just a comment. I feel like agreeing with your answer below; I too, tend to think that it is possible only with single and complete linkages. – ttnphns Dec 21 '13 at 21:55
  • @ttnphns I believe for UPGMA or one more of the others I've read somewhere (on Wikipedia maybe) that there also exists an $O(n^2)$ approach. But at least for Ward I don't think this will work; and I know SLINK and CLINK exist; and SLINK in ELKI also performs very well in benchmarks. – Has QUIT--Anony-Mousse Dec 21 '13 at 21:57

3 Answers3

3

As far as I know, this is not generally possible, but algorithms only exist for particular linkages; i.e. SLINK for single-link and CLINK for complete-link.

ELKI includes an $O(n^2)$ implementation, i.e. SLINK.

If I understood SLINK correctly, it essentially works "one row at a time", which is why it also needs less (linear!) memory.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
0

Clink and slink has time complexity of o(n2), but I don't understand how in clink and slink after merging two clusters they find the distance of that merged cluster from the new clusters .If they directly compute, then worst case time complexity should be (n-1)+2(n-2)+3(n-3)+4(n-4)+.....(n-1)(n-(n-1)) =O(n3).

0

This is simply to answer the last part of your question (cannot run HClustering on more than 15,000 samples).

If you really want to do a hierarchical clustering (for interpretation), I would suggest running kmeans (that will run quite fast with scikit) with a high k-value (for example 1000 or 100). Then, you run the HClustering on the resulting centroids to have a global clustering. (and you can reassign each point to its centroid and then to its cluster).

Scratch
  • 754
  • 2
  • 6
  • 17
  • mmm.. I wonder whether this would work.. still, I'm sure there are Hclustring algorithms that are better implemented then the one scikit is implementing. – idoda Dec 18 '13 at 14:42
  • yes this would work. It's called 2-steps clustering. – Scratch Dec 18 '13 at 14:48
  • Well, for me, this is not possible. I cant do K-Means on my data, all I have (and all i can generate) is an observation2observation distance, but i cant calculate distance to an arbitrary reference point. – idoda Dec 22 '13 at 11:19