Clustering based on large Jensen-Shannon Divergence distance matrix

Question

I have a dataset with large number of features and about 15 000 observations. I’m using a probability distribution distance metric related to Jensen-Shannon divergence (JSD) to cluster the observations calculated as described in http://enterotype.embl.de/enterotypes.html. I’m applying R implementation of Partitioning Around Medoids (PAM) clustering to my JSD distance matrix.

The issue is that the size of the distance matrix seems to be too big and eats all the memory. I’m looking for alternative implementations. k-means doesn’t work with other distance metrics than eucleidean, R clara works only with eucleidean and manhattan distance matrices. Sparks do not support pam or clara yet.

Will this proposal https://stats.stackexchange.com/a/12503/46598 work for transforming my JSD distance matrix to eucleidean distance matrix?
What else would you consider?

UPDATE 5.2.2015: Thanks, Aleksandr for thorough collection of links! I would actually like to stay in the scope of my question. I'm interested in similarities of my documents. The JSD distances express the document similarities very well and I'm happy with it. I would like to continue from that. I need to figure out for each document the most similar ones that are relevant from business point of view.

There are two overly naive approaches by ordering the similar documents for each document according to JSD distance and either select n first or n first below some JSD threshold as the similar ones.

The better approach is to let clustering algorithm to form the clusters because both of those naive approaches either cut off too many similar ones or identify similars also those that are actually not.

So my question is: which clustering algorithm is the most reasonable to use given that I have very good JSD distance matrix to feed in already.

Are you limited to a specific type (or a set of types) of clustering or you can consider alternatives? — Aleksandr Blekh, Feb 04 '15 at 17:56
One more clarification: is your data mostly continuous, mostly categorical or mixed? — Aleksandr Blekh, Feb 04 '15 at 18:05
Document-term data, where term frequency distributions are converted to probability distributions per document. — Andres Kull, Feb 04 '15 at 18:37
I've tried to provide a comprehensive overview of options. However, unless I'm missing something, it seems that your best option is topic modeling and, thus, LDA (mentioned under model-based clustering/Dirichlet processes). I hope that my answer is helpful. — Aleksandr Blekh, Feb 04 '15 at 20:24
Have you tried ELKI? It's the best tool for clustering IMHO. It had lots of fast algorithms you don't find everywhere, and usually is both a lot faster and less memory hungry than R. 15k observations is not that much, I've clustered 100k. — Has QUIT--Anony-Mousse, Feb 04 '15 at 23:19
I edited my question with some more thoughts. I would like to stay in the scope of the original question. — Andres Kull, Feb 05 '15 at 12:53

score 3 · Answer 1 · edited May 23 '17 at 12:39

Depending on specifics, consider the following alternatives. I am sure that you're familiar with some, but maybe not all methods. Additionally, some of the papers, which I've referenced below, describe algorithm modifications, which might be appropriate for your specific task and data sets.

K-means adaptations. For example, see this paper and this paper. Also, see this paper on using bootstrapping in K-means cluster analysis (while the paper is focused on the speed, the space improvement is IMHO implied as well due to nature of the bootstrapping process).
Model-based clustering: mixture modeling. This is an interesting option, implemented in several R packages, most notably mclust (http://www.stat.washington.edu/mclust). The approach, methods and software are well presented in this vignette paper.
Model-based clustering: Dirichlet processes (DP). Another popular option is Bayesian-based Dirichlet mixture models and hierarchical DP. If I understand the material correctly, (probabilistic) topic modeling also fits this category and includes such approaches, as latent Dirichlet allocation (LDA) (note: not to be confused with different method with same abbreviation - linear discriminant analysis (LDA), mostly used for dimensionality reduction, as I understand). More information on LDA: this introductory paper, some other relevant publications and a very recent paper on much improved LDA approach.
Hierarchical clustering (HC). In addition to traditional HC, you may find some interesting hybrid approaches, such as Dirichlet diffusion trees, which applies HC approach to DP mixtures (see this paper; other partial related research can be found via links on this page).
Latent variable modeling (LVM)-based clustering. Clustering applications include latent class analysis (LCA)-based latent class clustering (LCC) (see this paper) and latent tree models, described here. For some discussion, including a comparison between LCA and K-means, see this page and this paper.
Information theory-based clustering. For example, see this paper.
Neural networks- and genetic algorithms-based clustering. For example, see this paper.
If I remember correctly, I think that I've also seen some papers on using entropy for classification, but can't find them at the moment (will update, if that changes).
Some other interesting/relevant papers: a comparison of LCC and PAM; clustering with Bregman divirgences (probably belongs to information theory-based clustering category).
Some relevant discussions on Cross Validated: here and here
For determining an optimal number of clusters, see this excellent answer on StackOverflow.

+1, it is a nice and thorough selection of links and papers. — ttnphns, Feb 05 '15 at 06:53
Thanks for a thorough links collection! It takes some time to go them through. Thank you for the upvote also! :) I will add one more link to navigate the clustering landscape: http://www.cs.gsu.edu/~wkim/index_files/ref/roadmapclustering.pdf — Andres Kull, Feb 05 '15 at 11:00
@AndresKull: You're very welcome :). The linked overview paper is very nice indeed. — Aleksandr Blekh, Feb 05 '15 at 22:06

score 1 · Accepted Answer · edited Oct 12 '16 at 15:22

1

I solved my R PAM clustering memory consumption issue by:

increasing AWS (Amazon Web Services‎) instance memory
changing JSD distance matrix elements from numerical to integer:
```
jsd_dist_int <- as.matrix(as.integer(jsd_dist * 1000))
```

edited Oct 12 '16 at 15:22

Community

1

answered Feb 09 '15 at 15:49

Andres Kull

151
1
7

score 0 · Answer 3 · answered May 22 '19 at 03:50

This thread is a little old, but in the world of networks, it is common practice to use connectivity-based (hierarchical) clustering such as Wards, etc. when dealing with pair-wise distance measures. I have attached a link to an article that presents a procedure similar to the one you've described. After calculating the JSD matrix, they apply hierarchical clustering, and then follow-up with a third step of using the Von Neumann entropy as a criteria for deciding the optimal place for cutting the hierarchy dendrogram.

https://arxiv.org/pdf/1405.0425.pdf

Clustering based on large Jensen-Shannon Divergence distance matrix

3 Answers3

Linked