Use the Matrix package to store the matrix, and the skmeans_xdist function to calculate cosine distances.
/edit: It appears that the skmeans_xdist
function is not very efficient. Here's a simple example of how you would calculate cosine similarity for a netflix-sized matrix in R.
First, build the matrix:
library(Matrix)
set.seed(42)
non_zero <- 99000000
i <- sample(1:17770, non_zero, replace=TRUE)
j <- sample(1:480189, non_zero, replace=TRUE)
x <- sample(1:5, non_zero, replace=TRUE)
m <- sparseMatrix(i=i,j=j,x=x) #Rows are movies, columns are users
m <- drop0(m)
Next normalize each row so it's vector distance is 1. This takes 85 seconds on my machine.
row_norms <- sqrt(rowSums(m^2))
row_norms <- t(crossprod(sign(m), Diagonal(x=row_norms)))
row_norms@x <- 1/row_norms@x
m_norm <- m * row_norms
Finally, we can find cosine similarity, which takes me 155 seconds
system.time(sim <- tcrossprod(m_norm))
Also, note that the cosine similarity matrix is pretty sparse, because many movies do not share any users in common. You can convert to cosine distance using 1-sim
, but that might take a while (I haven't timed it).
/edit a couple years later: Here's a faster row-normalization function:
row_normalize <- function(m){
row_norms <- sqrt(rowSums(m^2))
row_norms <- t(crossprod(sign(m), Diagonal(x=row_norms)))
row_norms@x <- 1/row_norms@x
m_norm <- m * row_norms
return(m_norm)
}
fast_row_normalize <- function(m){
d <- Diagonal(x=1/sqrt(rowSums(m^2)))
return(t(crossprod(m, d)))
}
library(microbenchmark)
microbenchmark(
a = row_normalize(m),
b = fast_row_normalize(m),
times=1
)
The new function takes only 25 seconds (vs 89 seconds for the other one— I guess my computer got slower =/):
Unit: seconds
expr min lq mean median uq max neval
a 89.68086 89.68086 89.68086 89.68086 89.68086 89.68086 1
b 24.09879 24.09879 24.09879 24.09879 24.09879 24.09879 1