11

I'm trying to embed roughly 60 million phrases into a vector space, then calculate the cosine similarity between them. I've been using sklearn's CountVectorizer with a custom built tokenizer function that produces unigrams and bigrams. Turns out that to get meaningful representations, I have to allow for a tremendous number of columns, linear in the number of rows. This leads to incredibly sparse matrices and is killing performance. It wouldn't be so bad if there were only around 10,000 columns, which I think is pretty reasonable for word embeddings.

I'm thinking of trying to use Google's word2vec because I'm pretty sure it produces much lower dimensional and more dense embeddings. But before that, are there any other embeddings that might warrant a look at first? The key requirement would be being able to scale around 60 million phrases (rows).

I'm pretty new to the field of word embeddings so any advice would help.

I should also add that I'm already using singular value decomposition to improve performance.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Kevin Johnson
  • 213
  • 1
  • 6
  • You are using Spark? – eliasah Sep 22 '15 at 21:37
  • @eliasah nope, using numpy with openBLAS – Kevin Johnson Sep 22 '15 at 21:40
  • In order to improve performance over these kind of computing considering the number of entries, you might want to consider distributed computing systems like Spark or Flink. Word2Vec is a great algorithm but it's memory munger. – eliasah Sep 22 '15 at 22:04
  • Not too worried about memory tbh, doesn't seem to be that much of an issue with countvectorizer, which produces a larger matrix than word2vec (I think). The real problem is having to take the dot product of two 60million by 60 million matrices, I'd rather avoid having to port to GPU's by reducing dimension of the embedding. – Kevin Johnson Sep 22 '15 at 22:29
  • Have you tried to compute PCA? – eliasah Sep 22 '15 at 23:53
  • Yes I do take the SVD of the matrix of embedded vectors, but that's after the embedding happens. Do you know of a PCA technique that happens pre embedding? Also I should point out any SVD algorithm (there are a couple) involves a lot of matrix operations, which makes them infeasible on matrices of this size. – Kevin Johnson Sep 23 '15 at 00:07
  • 1
    That's one of the reasons I have suggested Spark at first. I'm sorry, I'm on my phone. I don't have access to any reference whatsoever concerning pre-embedding PCA techniques. – eliasah Sep 23 '15 at 00:09
  • What I'd really like to know is if there are other embedding algorithms possibly word2vec that behave in the same way as just counting frequencies as in countVectorizer BUT produce vectors of much lower dimension. Does that make sense? – Kevin Johnson Sep 23 '15 at 00:10
  • No worries, I appreciate the help! :) – Kevin Johnson Sep 23 '15 at 00:11
  • Also setting up spark for this specific task seems like overkill, also since I'm reading this data from postgres – Kevin Johnson Sep 23 '15 at 00:11
  • 1
    I'm not sure that it is an overkill with that amount of data. – eliasah Sep 23 '15 at 00:13
  • I think pre-embedding PCA is just preproccessing of the text lol, since getting rid of superfluous tokens results in lower dimension later! – Kevin Johnson Sep 23 '15 at 00:13
  • You could be right about that! – Kevin Johnson Sep 23 '15 at 00:14
  • 1
    Removing superfluous tokens shouldn't reduce the dimension by much since you are working texts. Considering a 150000 word dictionary, removing stop words per example would benefit you with a couple of dozen. That won't help. – eliasah Sep 23 '15 at 00:16
  • 1
    Otherwise, you might want to consider topics modeling with Latent Dirichlet Allocation to reduce your text vector size per phrase. – eliasah Sep 23 '15 at 00:24

1 Answers1

3

There's been some work recently on dynamically assigning word2vec (skip gram) dimension using Boltzmann machines. Check out this paper:

"Infinite dimensional word embeddings" -Nalsnick, Ravi

The basic idea is to let your training set dictate the dimensionality of your word2vec model, which is penalized by a regularization term that's related to the dimension size.

The above paper does this for words, and I'd be curious to see how well this performs with phrases.

Alex R.
  • 13,097
  • 2
  • 25
  • 49