I'm trying to embed roughly 60 million phrases into a vector space, then calculate the cosine similarity between them. I've been using sklearn's CountVectorizer
with a custom built tokenizer function that produces unigrams and bigrams. Turns out that to get meaningful representations, I have to allow for a tremendous number of columns, linear in the number of rows. This leads to incredibly sparse matrices and is killing performance. It wouldn't be so bad if there were only around 10,000 columns, which I think is pretty reasonable for word embeddings.
I'm thinking of trying to use Google's word2vec
because I'm pretty sure it produces much lower dimensional and more dense embeddings. But before that, are there any other embeddings that might warrant a look at first? The key requirement would be being able to scale around 60 million phrases (rows).
I'm pretty new to the field of word embeddings so any advice would help.
I should also add that I'm already using singular value decomposition to improve performance.