9

I have a large 400000 $\times$ 400000 dataset (dissimilarity matrix) and I want to do multi-dimensional scaling on it. However, after looking at the generic cmdscale() function in R, it only takes maximum 46340 $\times$ 46340 matrix as input. Is there any way I can work around this? Also, does Python has the same kind of limitation (in sklearn package)?

I should add that memory is not an issue here.

Percy
  • 93
  • 1
  • 5
  • I would say this is quite a big matrix. Even plain eigendecomposition will be problematic. MDS is more than that. Have you considered first to do a cluster analysis to reduce the cardinality of the data? – ttnphns Mar 02 '17 at 06:00
  • 1
    It might be worth looking at stochastic variations of dimensionality reduction algorithms for such a large dataset. [tSNE](https://lvdmaaten.github.io/tsne/) might be useful. – rightskewed Mar 02 '17 at 09:16
  • Do you really _need_ to process all that data? Without any more details, I think this is a more immediate question you have to answer yourself before proceeding. – Firebug Dec 08 '17 at 17:07
  • 1
    You might find this thread useful: https://stats.stackexchange.com/questions/68678/out-of-sample-embedding-into-principal-coordinate-space. It lists some extendible MDS implementations, which would make your problem tractable through approximation: initiate an MDS space with a random sample of your dataset and project the rest. – Eli Korvigo Mar 27 '18 at 18:29

3 Answers3

15

It would take more than a terabyte just to store a dense distance matrix of that size, using double precision floating point numbers. It's probably not feasible to attack such a large-scale problem head-on using a standard MDS algorithm. The runtime scales as $O(n^3)$, and you may not even be able to fit the distance matrix in memory on a single machine. Depending on your situation, you may be able to work around it. How to do this depends on your specific data and problem, but here are a few examples:

1) PCA is equivalent to classical MDS using the Euclidean distance metric. So, if this is the type of MDS you want to perform, you can use PCA instead. An additional requirement is that you have access to the original data points as vectors, rather than just the distance matrix. There are many algorithms for efficiently running PCA on enormous datasets.

2) Use an approximate form of MDS. This approach could work if you want to perform classical MDS. It doesn't require a Euclidean distance metric or a vector space representation of the data (access to the distances is all that's required). Landmark MDS is a good example. The first step is to choose a smaller set of 'landmark points' that represent the full data set. These can be obtained simply by sampling randomly from the data, or using a quick, greedy algorithm to optimize them a little more. Clustering could also be used in principle, but it's computationally expensive and not necessary. Classical MDS is performed on the landmark points to embed them in a vector space. The classical MDS results for the landmark points are then used to map the full dataset into the embedding space. Although not originally formulated as such, it turns out that Landmark MDS works by using the Nyström approximation, which is a way to approximate the eigenvalues/eigenvectors of a large matrix using a smaller submatrix.

3) If you want to perform nonclassical MDS, methods like Landmark MDS won't work. This is because nonclassical variants of MDS are solved iteratively using optimization algorithms, rather than by solving an eigenvalue problem. It might work to take the following approach instead: a) Select a representative subsample of the data. b) Use your chosen flavor of MDS to embed these points into a vector space. c) Using the subsampled data points as exemplars, learn a mapping into the embedding space using a nonlinear regression method. Proper care would be needed to learn a good mapping, not overfit, etc. d) Use the learned mapping to obtain an embedding for the entire data set. I recall seeing a couple papers describe this kind of approach as a way to perform out-of-sample generalization for nonlinear dimensionality reduction algorithms in general. But, it seems to me that it could also be used as a way to efficiently approximate nonclassical MDS. It's possible that more specialized solutions exist; if so I'd be glad to hear about them.

4) Parallelization could be used to solve the full problem if absolutely necessary. For example, for classical MDS, you could distribute blocks of the distance matrix across machines, then use parallel/distributed matrix operations and eigensolver to perform MDS. Some scheme could probably be imagined for nonclassical MDS. But, this could be a hefty undertaking, and it's quite possible that approximate MDS would work well enough. So, that's a better place to start.

References:

De Silva and Tenenbaum (2004). Sparse multidimensional scaling using landmark points.

Platt (2005). FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms.

user20160
  • 29,014
  • 3
  • 60
  • 99
  • @Percy This [thread](https://stats.stackexchange.com/questions/68678/out-of-sample-embedding-into-principal-coordinate-space) references several landmark MDS implementations in R and Python – Eli Korvigo Mar 27 '18 at 18:32
2

If you want to perform MDS on such large datasets you would need to use a parallel implementation of MDS. You can find a parallel implementation of MDS based on MPI at [1]. But in order to run a 400000 ×× 400000 dataset you would need a large cluster or a super computer. I have used this with a dataset with 200000 x 200000 data matrix ( i used a 24 node supercomputer to run it, could be done with less but will take more time). There is also a version that runs on spark which you can use [2]. But the spark implementation is much slower than the MPI based implementation

[1] https://github.com/DSC-SPIDAL/damds

[2] https://github.com/DSC-SPIDAL/damds.spark

pulasthi
  • 121
  • 3
0

I was experiencing severe performance issues with cmdscale(). After some digging I managed to find a java library called mdsj (https://www.inf.uni-konstanz.de/exalgo/software/mdsj/) that can receive a matrix in textfile format and output a multidimensional scaled version to a textfile. This can all be done in bash which can be called using R's system() function. I'll share my code here just in case anybody finds it useful:

saveMDS = function(daisy.mat){
   write.table(daisy.mat, file=paste("./MDS/data.txt",sep=""), row.names=FALSE, col.names=FALSE)
   system('bash -c "cd MDS; > data-MDS.txt;java -jar mdsj.jar data.txt data-MDS.txt"')
}


loadMDS = function(){

return(as.matrix(read.table("./MDS/data-MDS.txt",header=FALSE,sep=" ")))


}

data <- data.frame(x=runif(10000),y=runif(10000),z=runif(10000))
daisy.mat <- as.matrix(daisy(data[,c(1:3)], metric="gower"))
saveMDS(daisy.mat)
mds.data = loadMDS()

In this isntance, I have a folder called MDS in my working directory which the data text files are stored to.

Hope this helps!

Gregory
  • 21
  • 3