Cosine similarity on sparse matrix

Question

I'm trying to implement item based filtering, with a large feature space representing consumers who bought (1) or did not buy (0) a particular product.

I have a long tail distribution, so the matrix is quite sparse. R is not handling it well. What can I do to streamline the measurement of cosine similarity?

Note that cosine for binary data is [Ochiai measure](http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fsyn_proximities_measures_binary_data.htm). You could compute any of such "binary" measure using these [guidelines](http://stats.stackexchange.com/a/49477/3277) if RAM memory allows. — ttnphns, Jun 15 '13 at 08:12

Zach · Answer 1 · 2016-05-13T14:57:55.803

Use the Matrix package to store the matrix, and the skmeans_xdist function to calculate cosine distances.

/edit: It appears that the skmeans_xdist function is not very efficient. Here's a simple example of how you would calculate cosine similarity for a netflix-sized matrix in R.

First, build the matrix:

library(Matrix)
set.seed(42)
non_zero <- 99000000
i <- sample(1:17770, non_zero, replace=TRUE)
j <- sample(1:480189, non_zero, replace=TRUE)
x <- sample(1:5, non_zero, replace=TRUE)
m <- sparseMatrix(i=i,j=j,x=x) #Rows are movies, columns are users
m <- drop0(m)

Next normalize each row so it's vector distance is 1. This takes 85 seconds on my machine.

row_norms <- sqrt(rowSums(m^2))
row_norms <- t(crossprod(sign(m), Diagonal(x=row_norms)))
row_norms@x <- 1/row_norms@x
m_norm <- m * row_norms

Finally, we can find cosine similarity, which takes me 155 seconds

system.time(sim <- tcrossprod(m_norm))

Also, note that the cosine similarity matrix is pretty sparse, because many movies do not share any users in common. You can convert to cosine distance using 1-sim, but that might take a while (I haven't timed it).

/edit a couple years later: Here's a faster row-normalization function:

row_normalize <- function(m){
  row_norms <- sqrt(rowSums(m^2))
  row_norms <- t(crossprod(sign(m), Diagonal(x=row_norms)))
  row_norms@x <- 1/row_norms@x
  m_norm <- m * row_norms
  return(m_norm)
}

fast_row_normalize <- function(m){
    d <- Diagonal(x=1/sqrt(rowSums(m^2)))
    return(t(crossprod(m, d)))
}

library(microbenchmark)
microbenchmark(
  a = row_normalize(m),
  b = fast_row_normalize(m),
  times=1
)

The new function takes only 25 seconds (vs 89 seconds for the other one— I guess my computer got slower =/):

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
    a 89.68086 89.68086 89.68086 89.68086 89.68086 89.68086     1
    b 24.09879 24.09879 24.09879 24.09879 24.09879 24.09879     1

Figured it out. `row_norms@x` references the non zero elements of the matrix (which, in this case you're updating). Awesome answer! — Ben, May 12 '16 at 21:35

score 5 · Accepted Answer · answered Jun 07 '13 at 09:09

R often does not scale well to large data. You may need to move on to more efficient implementations. There are plenty of choices around. But of course, there probably are also various R packages that could help you a bit further.

Also, it pays off to stop thinking in matrixes. What you are working with is a graph. 1 is an edge, and 0 is not. An easy way to accelerate computing the similarities here is to cleverly exploit the similarity. This btw. is pretty much the benefits you get by processing the data in "column form" in Hadoop, for example.

When you realize that cosine similarity consists of three components: product of A and B, length of A and length of B, you will notice that two parts are independent of the other vector, and the third part has the squared sparsity, this will drastically reduce the computations needed for a cosine similarity "matrix" (again, stop seeing it as a matrix)

The streamlining then is:

Compute the length of each single vector, for normalization (i.e. compute $|A|$)
For each attribute (!) send a message to each pair of non-zero entries. If you have $0.01$ of non-zero values in each of $c$ columns, this will be just $O(0.0001 * c * n^2)$ messages.
Count the number of messages received for each pair, this is $A\cdot B$, divide by $|A|$ and $|B|$.

Alternatively:

Standardize each vector to length $|A|=1$ (your matrix will no longer be binary!)
For each attribute (!) send a message with the product of the two values to each pair of non-zero entries. If you have $0.01$ of non-zero values in each of $c$ columns, this will be just $O(0.0001 * c * n^2)$ messages.
Sum the messages received for each pair, this is the cosine similarity. (No division necessary, as $|A|=|B|=1$)

And definitely think about how to store and organize your data in memory. Here, fast data access and manipulation is 90% of the cost, the actual computations are trivial. Don't let R do it automatically, because that probably means it is doing it wrong...

The graphical approach is what I first did, but I used Jaccard similarity on a transformed one mode matrix instead. It didn't give the results I wanted. — Olga Mu, Jun 08 '13 at 14:58
But looking closer, this is going to solve my exact problem! — Olga Mu, Jun 08 '13 at 15:10
Just a comment on R's scalability: R has a very efficient [sparse matrix object](http://cran.r-project.org/web/packages/Matrix/index.html) ("sparse matrices" and "graphs" are incredibly similar). For example, using the [irlba package](http://cran.r-project.org/web/packages/irlba/index.html), one can factor the netflix dataset (480k users, 18k movies, 100 million non-zero entries) in 120 seconds with 5 lines of code. — Zach, Jun 06 '14 at 18:02
@OlgaMu Following up on my comment a few years later: Steps 2 and 3 in this answer can be very concisely expressed using matrix algebra, e.g. `tcrossprod(m)` in R, which will only do `O(0.0001*c*n^2)` operations if `m` is sparse. — Zach, May 13 '16 at 15:02
`stop thinking in matrixes. What you are working with is a graph` This is not clear, because _graph_ is a visual schema but _matrix_ is a type of values array, so they are not a contradiction. A graph can be stored on computer as a matrix or as a list or sometimes as heap, etc. — ttnphns, May 13 '16 at 16:33
Yes, so don't think of it as being a squared matrix; because it probably isn't. — Has QUIT--Anony-Mousse, May 13 '16 at 17:24

score 1 · Answer 3 · answered Jun 07 '13 at 09:04

1

Part of the mcl software (http://micans.org/mcl) is a program mcxarray created specifically to do fast all vs all comparisons, including cosine similarity. It utilises sparse representations, and it can be parallelised over different CPUs (with threads) and different machines (with jobs), with easy re-integration of the computed parts. Disclaimer - it was written by me.

answered Jun 07 '13 at 09:04

micans

1,689
8
11

Very intriguing! – Olga Mu Jun 08 '13 at 15:07

score 0 · Answer 4 · answered Dec 20 '16 at 17:23

0

I am using gensim, which works pretty well especially with text data which is usually high dimensional and sparse

answered Dec 20 '16 at 17:23

icm

131
2

score 0 · Answer 5 · answered Nov 21 '20 at 15:26

You may consider using qlcMatrix::cosSparse. It is the fastest R package for computing cosine similarity on a matrix > 50% sparse:

See my answer here for benchmarking results:

https://stackoverflow.com/questions/29417754/is-there-any-sparse-support-for-dist-function-in-r/64944105#64944105

Cosine similarity on sparse matrix

5 Answers5