I have been struggling to create a distance matrix for some Big Data (800,000x20). I have tried R (dist function), Matlab (pdist function), and cloud computing (to increase RAM).
Ultimately, the limitation that I come up against is the maximum array size that each program can use.
The dist function in R returns the following error: Error: cannot allocate vector of size 35298.4 Gb
This is not resolved by increasing the RAM to the largest possible size on cloud computing. Note: R is not preferred because many functions that I need to use (such as hclust) cannot process objects for more than 65536 rows.
The pdist function in MatLab, running on an AWS cloud computer, returns the following error: Requested 1x252043965036 (1877.9GB) array exceeds maximum array size preference.
I suspect that the solution is to calculate distribution matrices on subsets of the data and then fuse them together, however, I am not sure how to do this in a way that ensures that the distance between every cell is preserved. A solution in MatLab is preferred.
My goal is to perform cluster analysis on the data. For instance:
X= randi(5,800000,20);
Y = pdist(X);
Z = linkage(Y);
dendrogram(Z)