4

I have a dataset with large number of features and about 15 000 observations. I’m using a probability distribution distance metric related to Jensen-Shannon divergence (JSD) to cluster the observations calculated as described in http://enterotype.embl.de/enterotypes.html. I’m applying R implementation of Partitioning Around Medoids (PAM) clustering to my JSD distance matrix.

The issue is that the size of the distance matrix seems to be too big and eats all the memory. I’m looking for alternative implementations. k-means doesn’t work with other distance metrics than eucleidean, R clara works only with eucleidean and manhattan distance matrices. Sparks do not support pam or clara yet.

UPDATE 5.2.2015: Thanks, Aleksandr for thorough collection of links! I would actually like to stay in the scope of my question. I'm interested in similarities of my documents. The JSD distances express the document similarities very well and I'm happy with it. I would like to continue from that. I need to figure out for each document the most similar ones that are relevant from business point of view.

There are two overly naive approaches by ordering the similar documents for each document according to JSD distance and either select n first or n first below some JSD threshold as the similar ones.

The better approach is to let clustering algorithm to form the clusters because both of those naive approaches either cut off too many similar ones or identify similars also those that are actually not.

So my question is: which clustering algorithm is the most reasonable to use given that I have very good JSD distance matrix to feed in already.

Andres Kull
  • 151
  • 1
  • 7
  • Are you limited to a specific type (or a set of types) of clustering or you can consider alternatives? – Aleksandr Blekh Feb 04 '15 at 17:56
  • All good alternatives are welcome! – Andres Kull Feb 04 '15 at 17:57
  • One more clarification: is your data mostly continuous, mostly categorical or mixed? – Aleksandr Blekh Feb 04 '15 at 18:05
  • Document-term data, where term frequency distributions are converted to probability distributions per document. – Andres Kull Feb 04 '15 at 18:37
  • I've tried to provide a comprehensive overview of options. However, unless I'm missing something, it seems that your best option is topic modeling and, thus, LDA (mentioned under model-based clustering/Dirichlet processes). I hope that my answer is helpful. – Aleksandr Blekh Feb 04 '15 at 20:24
  • Have you tried ELKI? It's the best tool for clustering IMHO. It had lots of fast algorithms you don't find everywhere, and usually is both a lot faster and less memory hungry than R. 15k observations is not that much, I've clustered 100k. – Has QUIT--Anony-Mousse Feb 04 '15 at 23:19
  • I edited my question with some more thoughts. I would like to stay in the scope of the original question. – Andres Kull Feb 05 '15 at 12:53

3 Answers3

3

Depending on specifics, consider the following alternatives. I am sure that you're familiar with some, but maybe not all methods. Additionally, some of the papers, which I've referenced below, describe algorithm modifications, which might be appropriate for your specific task and data sets.

  • K-means adaptations. For example, see this paper and this paper. Also, see this paper on using bootstrapping in K-means cluster analysis (while the paper is focused on the speed, the space improvement is IMHO implied as well due to nature of the bootstrapping process).
  • Model-based clustering: mixture modeling. This is an interesting option, implemented in several R packages, most notably mclust (http://www.stat.washington.edu/mclust). The approach, methods and software are well presented in this vignette paper.
  • Model-based clustering: Dirichlet processes (DP). Another popular option is Bayesian-based Dirichlet mixture models and hierarchical DP. If I understand the material correctly, (probabilistic) topic modeling also fits this category and includes such approaches, as latent Dirichlet allocation (LDA) (note: not to be confused with different method with same abbreviation - linear discriminant analysis (LDA), mostly used for dimensionality reduction, as I understand). More information on LDA: this introductory paper, some other relevant publications and a very recent paper on much improved LDA approach.
  • Hierarchical clustering (HC). In addition to traditional HC, you may find some interesting hybrid approaches, such as Dirichlet diffusion trees, which applies HC approach to DP mixtures (see this paper; other partial related research can be found via links on this page).
  • Latent variable modeling (LVM)-based clustering. Clustering applications include latent class analysis (LCA)-based latent class clustering (LCC) (see this paper) and latent tree models, described here. For some discussion, including a comparison between LCA and K-means, see this page and this paper.
  • Information theory-based clustering. For example, see this paper.
  • Neural networks- and genetic algorithms-based clustering. For example, see this paper.
  • If I remember correctly, I think that I've also seen some papers on using entropy for classification, but can't find them at the moment (will update, if that changes).
  • Some other interesting/relevant papers: a comparison of LCC and PAM; clustering with Bregman divirgences (probably belongs to information theory-based clustering category).
  • Some relevant discussions on Cross Validated: here and here
  • For determining an optimal number of clusters, see this excellent answer on StackOverflow.
Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
1

I solved my R PAM clustering memory consumption issue by:

  1. increasing AWS (Amazon Web Services‎) instance memory
  2. changing JSD distance matrix elements from numerical to integer:

    jsd_dist_int <- as.matrix(as.integer(jsd_dist * 1000))
    
Andres Kull
  • 151
  • 1
  • 7
0

This thread is a little old, but in the world of networks, it is common practice to use connectivity-based (hierarchical) clustering such as Wards, etc. when dealing with pair-wise distance measures. I have attached a link to an article that presents a procedure similar to the one you've described. After calculating the JSD matrix, they apply hierarchical clustering, and then follow-up with a third step of using the Von Neumann entropy as a criteria for deciding the optimal place for cutting the hierarchy dendrogram.

https://arxiv.org/pdf/1405.0425.pdf

Jacob
  • 1