21

I'm trying to compile a list of clustering algorithms that are:

  1. Implemented in R
  2. Operate on sparse data matrices (not (dis)similarity matrices), such as those created by the sparseMatrix function.

There are several other questions on CV that discuss this concept, but none of them link to R packages that can operate directly on sparse matrices:

  1. Clustering large and sparse datasets
  2. Clustering high-dimensional sparse binary data
  3. Looking for sparse and high-dimensional clustering implementation
  4. Space-efficient clustering

So far, I've found exactly one function in R that can cluster sparse matrices:

skmeans: spherical kmeans

From the skmeans package. kmeans using cosine distance. Operates on dgTMatrix objects. Provides an interface to a genetic k-means algorithm, pclust, CLUTO, gmeans, and kmndirs.

Example:

library(Matrix)
set.seed(42)

nrow <- 1000
ncol <- 10000
i <- rep(1:nrow, sample(5:100, nrow, replace=TRUE))
nnz <- length(i)
M1 <- sparseMatrix(i = i,
                   j = sample(ncol, nnz, replace = TRUE),
                   x = sample(0:1 , nnz, replace = TRUE), 
                   dims = c(nrow, ncol))
M1 <- M1[rowSums(M1) != 0, colSums(M1) != 0]

library(skmeans)
library(cluster)
clust_sk <- skmeans(M1, 10, method='pclust', control=list(verbose=TRUE))
summary(silhouette(clust_sk))

The following algorithms get honerable mentions: they're not quite clustering algorithms, but operate on sparse matrices.

apriori: association rules mining

From the arules package. Operates on "transactions" objects, which can be coerced from ngCMatrix objects. Can be used to make recommendations.

example:

library(arules)
M1_trans <- as(as(t(M1), 'ngCMatrix'), 'transactions')
rules <- apriori(M1_trans, parameter = 
list(supp = 0.01, conf = 0.01, target = "rules"))
summary(rules)

irlba: sparse SVD

From the irlba package. Does SVD on sparse matrices. Can be used to reduced the dimensionality of sparse matrices prior to clustering with traditional R packages.

example:

library(irlba)
s <- irlba(M1, nu = 0, nv=10)
M1_reduced <- as.matrix(M1 %*% s$v)
clust_kmeans <- kmeans(M1, 10)
summary(silhouette(clust_kmeans$cluster, dist(M1_reduced)))

apcluster: Affinity Propagation Clustering

library(apcluster)
sim <- crossprod(M1)
sim <- sim / sqrt(sim)
clust_ap <- apcluster(sim) #Takes a while

What other functions are out there?

Zach
  • 22,308
  • 18
  • 114
  • 158
  • Do you mean sparse as in "lots of zeros" or as in "lots of missing values"? – cbeleites unhappy with SX Jan 07 '14 at 14:52
  • This question appears to be off-topic according to multiple criteria at http://stats.stackexchange.com/help/dont-ask: every answer would be equally valid, you expect more answers in addition to those provided, and there is no actual problem to be solved. – whuber Jan 07 '14 at 14:55
  • I realise this got closed, but I've been tripping over all your questions on this as I browse SO as I had a similar problem ;) I found this library which uses affinity propensity that can work with sparse matrices: http://www.bioinf.jku.at/software/apcluster/ – MarkeD Jul 09 '15 at 11:28
  • 1
    @MarkeD Thanks a lot! It's really too bad software recommendations are off-topic here, as I've found nowhere else online to ask for them. – Zach Jul 09 '15 at 12:56
  • @MarkeD Could you post an example with the M1 matrix in my question above? I can't seem to get apcluster to run on that matrix in a reasonable amount of time. – Zach Jul 09 '15 at 13:05
  • @Zach ahh sorry should have put more context in the comment ;) The algo only runs in reasonable time on small data sets but doesn't care about "lots of zero" matrices - I run it on a sample and it then outputs "examplers" which for your big matrix case you could then use as a start point for k-means. So not dimension reduction etc. but alt approach(?) – MarkeD Jul 09 '15 at 13:55
  • I'll just mention mcl (http://micans.org/mcl), as that algorithm is actually closely tied to sparse matrices; it is most naturally formulated in the language of matrix operations and utilises sparse matrices in the implementation. It is a custom C implementation (not directly available in R, although people have written R interfaces to the external mcl process), quite widely used in bioinformatics. – micans Jul 09 '15 at 16:19
  • 3
    once again very useful question is closed :( if you dont know the answer just dont vote for close! – MonsterMMORPG Apr 04 '16 at 23:03

1 Answers1

1

I don't use R. It is often very slow and has next to no indexing support. But software recommendations are considered off-topic anyway.

Note that plenty of algorithms don't care how you store your data. If you prefer to have a sparse matrix, that should be your choice, not the algorithms choice.

People that use too much R tend to get stuck in thinking in matrix operations (because that is the only way to write fast code in R). But that is a limited way of thinking. For example k-means: it doesn't care. In particular, it doesn't use pairwise distances at all. It just needs a way to compute the variance contribution; which is equivalent to computing the squared Euclidean distance.

Or DBSCAN. All it needs is a "neighbor" predicate. It can work with arbitrary graphs; it's just that Euclidean distance and the Epsilon threshold is the most common way of computing the neighborhood graph it uses.

P.S. Your question isn't very precise. Do you refer to sparse data matrixes or sparse similarity matrixes?

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • 1
    sparse data matrixes – Zach Jan 06 '14 at 20:35
  • Most algorithms can operate on sparse data matrixes. E.g., AGNES, PAM, DBSCAN, OPTICS, CLARA, ... – Has QUIT--Anony-Mousse Oct 11 '17 at 21:29
  • Not sure why you even answered if you don't even know R. – user3932000 Jul 31 '19 at 16:08
  • I know R. Probably even better than the average R user. I know non-standard evaluation in R and I know that most of the modules are written in C, so when you pass a sparse matrix, it is first copied into a sense matrix before passing it to the actual code. And every package uses a different way of doing so... That is not efficient. You don't choose R if you need efficiency or good integration or backwards compatibility or coordinated development. – Has QUIT--Anony-Mousse Jul 31 '19 at 17:40