6

I'm trying to cluster set of histograms. The histograms represent the frequencies of the distribution for a numbers from 1 to 5. The following figure shows two samples of my data.

enter image description here

I have 10,000 histograms with fixed number of bins (5) and I'm looking for a simple clustering algorithm implemented in MATLAB, C# or C++, that can take the histograms and cluster them.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Omar14
  • 399
  • 1
  • 5
  • 11
  • 1
    Take a look [here](http://arxiv.org/pdf/cs/0509033.pdf) and [here](http://link.springer.com/chapter/10.1007%2F978-3-540-73560-1_12). I couldn't find an ungated copy of the second article. – shadowtalker May 15 '15 at 16:20
  • Unfortunately, the second article is worth a mint. :) Thanks for this links! – Michael Dorner Jul 25 '15 at 10:02
  • I might try to use PCA to group them. It is 5-dimensional continuous data, and you are trying to pack it into discrete bins. – EngrStudent Dec 23 '16 at 23:27

2 Answers2

5

Use hierarchical clustering or DBSCAN.

They have one huge benefit over k-means: they work with arbitrary distance measures, and with histograms you might want to use like, for example, Jensen-Shannon divergence, etc. that are designed to capture the similarity of distributions.

DarkCygnus
  • 168
  • 10
Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
1

K-means could do this. K-means is an unsupervised clustering algorithm. Rewrite each histogram as a vector and use Euclidean distance.

This post goes into the assumptions of K-means: How to understand the drawbacks of K-means You might want to check these.

You have to determine the number of clusters yourself by estimating models with different k.

spdrnl
  • 2,017
  • 8
  • 11