1

I have been struggling to create a distance matrix for some Big Data (800,000x20). I have tried R (dist function), Matlab (pdist function), and cloud computing (to increase RAM).

Ultimately, the limitation that I come up against is the maximum array size that each program can use.

The dist function in R returns the following error: Error: cannot allocate vector of size 35298.4 Gb

This is not resolved by increasing the RAM to the largest possible size on cloud computing. Note: R is not preferred because many functions that I need to use (such as hclust) cannot process objects for more than 65536 rows.

The pdist function in MatLab, running on an AWS cloud computer, returns the following error: Requested 1x252043965036 (1877.9GB) array exceeds maximum array size preference.

I suspect that the solution is to calculate distribution matrices on subsets of the data and then fuse them together, however, I am not sure how to do this in a way that ensures that the distance between every cell is preserved. A solution in MatLab is preferred.

My goal is to perform cluster analysis on the data. For instance:

X= randi(5,800000,20);

Y = pdist(X);

Z = linkage(Y);

dendrogram(Z)

unicoder
  • 76
  • 6
  • 2
    Roughly, you will need more than 5000 Gb memory! Another thing to mention is that hierarchical cluster analysis - the majority of its linkage methods - are only locally optimal greedy algorithm (see last point [here](https://stats.stackexchange.com/a/63549/3277) that isn't suited for very many objects on _that_ ground, not because of memory limitation. I would recommend you to do clustering on much smaller random subsamples. Or to use another clustering method which doesn't need distance matrix. Search the site for `clustering large data`. – ttnphns Feb 19 '18 at 11:04
  • You can [search for threads that are tagged both "large-data" *and* "clustering"](https://stats.stackexchange.com/questions/tagged/large-data+clustering). – Stephan Kolassa Feb 19 '18 at 11:15
  • 2
    What is the ultimate goal? i.e. why do you think you need to compute a distance matrix? Perhaps there is a different way of achieving your bigger objective? – dt688 Feb 19 '18 at 12:43
  • Do you expect that your clusters are all equally represented in the data? – dt688 Feb 19 '18 at 15:10
  • @dt688 I do not expect equal representations of the clusters – unicoder Mar 16 '18 at 14:24
  • As an update: I have had some luck with kmeans and will explore other clustering algorithms. Thanks! – unicoder Mar 16 '18 at 14:25

0 Answers0