I'm trying to compile a list of clustering algorithms that are:
- Implemented in R
- Operate on sparse data matrices (not (dis)similarity matrices), such as those created by the sparseMatrix function.
There are several other questions on CV that discuss this concept, but none of them link to R packages that can operate directly on sparse matrices:
- Clustering large and sparse datasets
- Clustering high-dimensional sparse binary data
- Looking for sparse and high-dimensional clustering implementation
- Space-efficient clustering
So far, I've found exactly one function in R that can cluster sparse matrices:
skmeans: spherical kmeans
From the skmeans package. kmeans using cosine distance. Operates on dgTMatrix objects. Provides an interface to a genetic k-means algorithm, pclust, CLUTO, gmeans, and kmndirs.
Example:
library(Matrix)
set.seed(42)
nrow <- 1000
ncol <- 10000
i <- rep(1:nrow, sample(5:100, nrow, replace=TRUE))
nnz <- length(i)
M1 <- sparseMatrix(i = i,
j = sample(ncol, nnz, replace = TRUE),
x = sample(0:1 , nnz, replace = TRUE),
dims = c(nrow, ncol))
M1 <- M1[rowSums(M1) != 0, colSums(M1) != 0]
library(skmeans)
library(cluster)
clust_sk <- skmeans(M1, 10, method='pclust', control=list(verbose=TRUE))
summary(silhouette(clust_sk))
The following algorithms get honerable mentions: they're not quite clustering algorithms, but operate on sparse matrices.
apriori: association rules mining
From the arules package. Operates on "transactions" objects, which can be coerced from ngCMatrix objects. Can be used to make recommendations.
example:
library(arules)
M1_trans <- as(as(t(M1), 'ngCMatrix'), 'transactions')
rules <- apriori(M1_trans, parameter =
list(supp = 0.01, conf = 0.01, target = "rules"))
summary(rules)
irlba: sparse SVD
From the irlba package. Does SVD on sparse matrices. Can be used to reduced the dimensionality of sparse matrices prior to clustering with traditional R packages.
example:
library(irlba)
s <- irlba(M1, nu = 0, nv=10)
M1_reduced <- as.matrix(M1 %*% s$v)
clust_kmeans <- kmeans(M1, 10)
summary(silhouette(clust_kmeans$cluster, dist(M1_reduced)))
apcluster: Affinity Propagation Clustering
library(apcluster)
sim <- crossprod(M1)
sim <- sim / sqrt(sim)
clust_ap <- apcluster(sim) #Takes a while
What other functions are out there?