Simple way to cluster histograms

Question

I'm trying to cluster set of histograms. The histograms represent the frequencies of the distribution for a numbers from 1 to 5. The following figure shows two samples of my data.

enter image description here

I have 10,000 histograms with fixed number of bins (5) and I'm looking for a simple clustering algorithm implemented in MATLAB, C# or C++, that can take the histograms and cluster them.

Take a look [here](http://arxiv.org/pdf/cs/0509033.pdf) and [here](http://link.springer.com/chapter/10.1007%2F978-3-540-73560-1_12). I couldn't find an ungated copy of the second article. — shadowtalker, May 15 '15 at 16:20
Unfortunately, the second article is worth a mint. :) Thanks for this links! — Michael Dorner, Jul 25 '15 at 10:02
I might try to use PCA to group them. It is 5-dimensional continuous data, and you are trying to pack it into discrete bins. — EngrStudent, Dec 23 '16 at 23:27

score 5 · Answer 1 · edited Sep 26 '18 at 04:31

5

Use hierarchical clustering or DBSCAN.

They have one huge benefit over k-means: they work with arbitrary distance measures, and with histograms you might want to use like, for example, Jensen-Shannon divergence, etc. that are designed to capture the similarity of distributions.

edited Sep 26 '18 at 04:31

DarkCygnus

168
10

answered May 15 '15 at 20:24

Has QUIT--Anony-Mousse

39,639
7
61
96

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

1

K-means could do this. K-means is an unsupervised clustering algorithm. Rewrite each histogram as a vector and use Euclidean distance.

This post goes into the assumptions of K-means: How to understand the drawbacks of K-means You might want to check these.

You have to determine the number of clusters yourself by estimating models with different k.

edited Apr 13 '17 at 12:44

Community

1

answered May 15 '15 at 16:17

spdrnl

2,017
8
11

Simple way to cluster histograms

2 Answers2

Linked