0

The problem I face is somewhat awkward, I have 40,000 points in my dataset and I would like to cluster them hierarchically. But due to the limitation of my laptop(and R) in each run of clustering only 20,000 points can be contained.

I am wondering whether there is some way of combining the results of maybe several clustering processes. Or if there are other ways of solving it, such as online hierarchical clustering, please tell me some of the key words that I should google.

Thanks in advance.

  • Could you split it along the hierarchy? Do it layer by layer? – LauriK Nov 03 '14 at 07:34
  • A pragmatic solution would be to find a C/C++/... library to do the clustering for you, since those languages don't have memory constraints like R does. I'm sure some R clustering packages use C under the hood. – Marc Claesen Nov 03 '14 at 08:36
  • Please read the last paragraph in http://stats.stackexchange.com/a/63549/3277. You could do cluster analysis y subsamples. – ttnphns Nov 03 '14 at 11:23

1 Answers1

0

20.000 is not a lot. Maybe the problem is the R implementation.

I recently ran OPTICS clustering (which is quite advanced; with geodetic distance, and yields a cluster hierarchy), on 23 million tweet locations, on a single host: "Clustering 23 Mio Tweet locations"

23 million hosts, that is about half a GB of raw data, 240 MB compressed. Any modern system should be able to handle this amount of data easily.

Just don't compute a distance matrix.

Erich Schubert
  • 2,729
  • 1
  • 14
  • 22
  • This could sound really stupid, but I thought a distance matrix would be necessary for doing the clustering. Would you tell me some ways to avoid that? Some key words to google would be great. Thanks! – skyphantasy Nov 20 '14 at 03:12
  • Many algorithms need kNN or range queries. If you have a database oriented architecture that has accelerated kNN or range queries (as we have in ELKI), it can run algorithms significantly faster. That is, as long as there is an index that can accelerate your distance query. There are plenty of choices for Euclidean distance, for example; but for some non-linear non-metric distances it gets rather challenging. For some algorithms (hierarchical clustering: SLINK) there are algorithms that simply compute the distance matrix one row at a time. Runtime then usually is O(n^2), and memory is O(n). – Erich Schubert Nov 24 '14 at 12:29