31

What is the best way to compute singular value decomposition (SVD) of a very large positive matrix (65M x 3.4M) where data is extremely sparse?

Less than 0.1% of the matrix is non zero. I need a way that:

  • will fit into memory (I know that online methods exists)
  • will be computed in a reasonable time: 3,4 days
  • will be accurate enough however accuracy is not my main concern and I would like to be able to control how much resources I put into it.

It would be great to have a Haskell, Python, C# etc. library which implements it. I am not using mathlab or R but if necessary I can go with R.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Sonia
  • 413
  • 1
  • 4
  • 5
  • 3
    How much memory do you have? 0.1% of 65M*3.4M is still 221e9 non zero values. If you use 4 bytes per value, that is still more than 55 gb assuming no overhead, so the sparsity still doesn't solve the problem... Do you need to load the whole set into memory at once? – Bitwise Oct 26 '12 at 18:02
  • 1
    I should have been more precise. No more than 250-500mb with 32-bit integer. Probably much less, but the dimensionalilty is the problem as I understand it. I have a 16GB machine. – Sonia Oct 26 '12 at 18:05
  • How about this? http://www.quora.com/What-s-the-best-parallelized-sparse-SVD-code-publicly-available – Bitwise Oct 26 '12 at 19:16
  • This webpage links to a Python library which implements "a fast, incremental, low-memory, large-matrix SVD algorithm": http://en.wikipedia.org/wiki/Latent_semantic_analysis – Bitwise Oct 26 '12 at 19:56
  • See also https://stats.stackexchange.com/questions/2806. – amoeba May 23 '17 at 19:16

2 Answers2

24

If it fits into memory, construct a sparse matrix in R using the Matrix package, and try irlba for the SVD. You can specify how many singular vectors you want in the result, which is another way to limit the computation.

That's a pretty big matrix, but I've had very good results with this method in the past. irlba is pretty state-of-the-art. It uses the implicitly restarted Lanczos bi-diagonalization algorithm.

It can chew through the netflix prize dataset (480,189 rows by 17,770 columns, 100,480,507 non-zero entries) in milliseconds. You dataset is ~ 200,000 times bigger than the Netflix dataset, so it take significantly longer than that. It might be reasonable to expect that it could do the computation in a couple of days.

Zach
  • 22,308
  • 18
  • 114
  • 158
  • the data matrix fits into memory, will irlba will handle the decomposition in a memory efficient way as well? – Sonia Oct 26 '12 at 17:56
  • 1
    @Sonia: irlba is very memory efficient: it computes an approximate solution, you can limit the number of singular vectors, and it was designed to work on sparse matrices. As far as I know, it's as fast as you're going to get for computing partial SVDs. – Zach Oct 26 '12 at 18:05
  • @Sonia: Good luck! – Zach Oct 26 '12 at 18:14
  • Gave it a try - out of memory... I will compute a triangle block form before running it. – Sonia Nov 03 '12 at 13:13
  • @Sonia do you have it stored as a sparse `Matrix`? Try limiting the number of singular values you compute... maybe just look at the top 10? – Zach Nov 03 '12 at 18:43
  • I wish I have gotten this far. I have loaded the MatrixMarket forma0t, that's ~7GB RAM and tried to initialize a sparseMatrix with it. That is weird because as a text file it is only 5GB. It failed said I dont have enough memory. I suppose the data structure metadata takes up the rest of the space. I am currently reading about algorithms that will transform the matrix to a block form. The computation should much faster in this form as well. I enjoy the challenge. – Sonia Nov 04 '12 at 05:29
  • @Sonia: I'm not familiar with the MatrixMarket format. If you can read the file into R as 3 vectors (1 representing row number, 1 representing column number, and 1 representing the values), it is very easy to convert to the `Matrix` format, with very little overhead. Note that `irlba` does not work with the `sparseMatrix` format. You have to use the `Matrix` package. – Zach Nov 05 '12 at 11:21
  • @Sonia: Ok, it looks like the `readMM` function in the `Matrix` package will read your Matrix Market dataset in a sparse format. Trying running `irlba` with very small numbers, e.g. nu = 2, nv = 2. You might need a machine with ~16GB of RAM to do this. – Zach Jun 12 '13 at 14:37
2

If you're willing to have a low-rank approximation (as you would with Lanczos-type algorithms and a limited number of singular vectors), an alternative is stochastic SVD. You get similar accuracy and computational effort to things like irlba, but a much, much simpler implementation -- which is relevant if none of the available implementations handle precisely your situation.

If you start with an $N\times M$ matrix and you want a rank $k$ approximation, you need to be able to multiply your matrix by a dense matrix with, say, $M\times (k+10)$ entries and then do an ordinary SVD on the resulting $N\times (k+10)$ matrix. As with Lanczos-type algorithms, the computation only uses the matrix for multiplication (it's a "matrix-free" algorithm), so it will take advantage of sparsity and other structure. (My use case was in genetics and the matrix was the product of a sparse matrix and a projection orthogonal to a low-rank subspace)

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73