3

We normally have fairly large datasets to model on, just to give you an idea:

  • over 1M features (sparse, average population of features is around 12%);
  • over 60M rows.

A lot of modeling algorithms and tools don't scale to such wide datasets.

So we're looking for a dimensionality reduction implementation that runs distributely (i.e. in Spark/Hadoop/ etc). We think to bring number of features down to several thousand.

Since PCA operate on matrix multiplication, which don't distribute very well over a cluster of servers, we're looking at other algorithms or probably, at other implementations of distributed dimensionality reduction.

Anyone ran into similar issues? What do you do to solve this?

There is a Cornell/Stanford Abstract on "Generalized Low-Rank Models" http://web.stanford.edu/~boyd/papers/pdf/glrm.pdf that talks specifically into this:

  • page 8 "Parallelizing alternating minimization" tells how it can be distributed;
  • also page 9 "Missing data and matrix completion" talks how sparse/ missing data can be handled.

GLRM although seems to be what we are looking for, but we can't find good actual implementations of those ideas.

Update 7/15/2018: Another Abstract is Fast Randomized SVD from Facebook (read here http://tygert.com/spark.pdf ) and also idea to do low-rank matrix approximation using ALS - http://tygert.com/als.pdf . Although there is no clear way how to use them now - see discussion at https://github.com/facebook/fbpca/issues/6

Any other ideas how to tackle this? Other available GLRM or other distributed dimensionality reduction implementations?

Tagar
  • 143
  • 7
  • I assume you have tried SVD using MLlib or Spark ML ? – Robert Long Aug 28 '18 at 11:51
  • Yep. Thanks. Spark's SVD has a number of issues discussed in details here - http://tygert.com/spark.pdf – Tagar Aug 28 '18 at 15:40
  • Yes, I have read that paper - but what about the randomisation and explicit normalisation algorithms that they propose ? – Robert Long Aug 28 '18 at 15:44
  • You can read some more details on that here -https://github.com/facebook/fbpca/issues/6 – Tagar Aug 28 '18 at 15:46
  • Yep, saw that too, but only just noticed you are part of the discussion there :D ! So, did you try implementing their algorithm ? – Robert Long Aug 28 '18 at 16:21
  • Nope, we don't have capacity to (re)write algorithms.. if you find a good implementation that satisfies above requirements, you can post it here as an Answer and I'll accept it. thanks – Tagar Aug 28 '18 at 19:00
  • 1
    I might be asking some trivial here, but have you tried standard recommender system implementations and then simply extra the relevant item latent factors? – usεr11852 Jan 10 '19 at 23:43
  • @usεr11852 do you mean `frequent pattern mining` like `FP-growth` algorithm for example? I did mention this to my colleagues but for some reason we dismissed that idea (we had some specifics that I don't remember now). If you meant frequent pattern mining, then not sure I understand on latent factors.. can you please post more details as an answer - I will happily upvote or accept as correct answer if that works for us. thanks!! – Tagar Jan 15 '19 at 17:06
  • @Tagar: I was simply asking if you tried some standard approaches like FunkSVD, SVD++. Also just to be clear: you are referring to 60M $\times$ 1M data at 12% sparsity that 50-60TB? – usεr11852 Jan 15 '19 at 21:34
  • Yes. We couldn't make it work with clustered h2o and spark ml's hasn't worked for us for various reasons (some of them mentioned above). If scale-out solutions haven't worked for us, how FunkSVD, SVD++ would work. That's why this topic focuses on distributed implementations. How did you get 50-60Tb estimate? Most of memory layouts have nice columnar compressions (like Airflow), ans also some serializations have nice compressions for sparse datasets (potentially just storing one bit per a sparse cell). Our estimate for in-memory serialization around 1-10Tb. – Tagar Jan 15 '19 at 23:27
  • `60000000 * 1000000 * 8 / 1024^4 * 0.12` (in TBytes) (But maybe I am missing something else here!) – usεr11852 Jan 17 '19 at 22:01
  • Yep, you didn't account for columnar encoding/compression that I already mentioned above. For example, with dictionary encoding of a column that has 100k distinct values is `0.1` bytes per value on average (yes, 1 byte to store 10 values). (looked at a real prod Parquet table that we have - Arrow serialization should be very similar). yes many columns might have many bytes per value to store, but on average in-memory columnar compression can per super efficient. I think we're on off-topic now :) – Tagar Jan 18 '19 at 06:15

1 Answers1

3

A note about PCA

I'd like to clarify the language I'm going to use. Principal Component Analysis is an extremely wide umbrella term that is used for algorithms that try to capture "the gist" of the data (usually expressed as a matrix). There are two components in those algorithms: what are they trying to capture exactly and model that they employ to capture this information (the general idea is that the compressed version of the data should occupy less space and maybe reveal some patterns in the data).

Given these definitions, SVD is an optimal way to capture L2 norm of the matrix (meaning, we are minimizing $||A-\hat A||$) given low-rank linear model $\hat A = USV^T$ (U and V are matrices with orthonormal columns and S is diagonal). SVD++ or FunkSVD are not exactly the same, they have different loss function (they either do not take zeros into account or subsample them), they have an explicit regularization and they have no guarantees about orthogonality of found singular vectors (which you might not care much about).

Best distributed SVD libs

If you think that capturing the L2 norm of your matrix is a good idea, then I can suggest using one of two libraries that implement Randomized SVD algorithm: SparkRSVD library in Scala for Spark or Dask ML Truncated SVD in Python for, well, Dask. We've written the first and tested it against the second on matrices up to 100 mln x 100 mln rows and columns with very small sparsity (see our blog post here). They are both I think good options, the choice depends mainly on the language/framework you're most comfortable with. Also, we found Dask to be faster on relatively small matrices, but the Spark lib can scale to larger matrices (from what we tested).

Now, why Randomized SVD and not SVD from MLLib, for instance?

If you look under the hood, Spark MLLib implements SVD only for RowMatrix (with the limitation on num_cols <= MAX_INTEGER which is fine) and computes a Gramian matrix ($G=A^T A$) to produce a smaller matrix (in your case it should become almost dense 1 mln x 1 mln) that is easy to work with which should either fit in memory of a single node or at least num_rows * num_factors should. Which might or might not be okay for you depending on what kind of hardware you have or how many latent factors you need to reconstruct.

In contrast, Randomized SVD algorithms are based on the ideas of randomized algorithms and efficient QR decomposition. They do not require candom access to the matrix A, just the ability to multiply by it, and do not require storing the result of those multiplications at the same node, just a $\text{num_factors} \times \text{num_factors}$ matrix.

P.S. Updated the answer to be more precise about what Spark MLLib implements.

Ivan Lobov
  • 46
  • 3