2

I have an n-by-m array, where every column sums to 1, in other words I have m probability vectors of size n. I would like to cluster them into several categories.

I will appreciate, if somebody points me to a robust method that can be used for this purpose.

The crucial point here is that these are probability vectors. So I am reluctant to use anything that is based on Euclidean distance.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Roger Vadim
  • 1,481
  • 6
  • 17
  • In addition to @Anony's answer, look in https://stats.stackexchange.com/q/173636/3277 where I show usage of chi-sq distance for count (or probability) data clustering. – ttnphns Nov 08 '17 at 08:28

2 Answers2

2

Appropriate distance functions for probability distributions include:

  • Histogram Intersection distance
  • Chi² distance
  • Jensen Shannon divergence (in some symmetrical form, I don't remember the name of that)
  • Hellinger distance

There may be some more in ELKI. These are the ones I have played around with.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • I upvoted, but please be aware that the question is about "vector of probabilities data", i.e. simply about compositional data, and not about "probability distributions". By adding some options for the latter you may mislead the OP's expectations. Your answer is more general and less focused than it is expected, I think. – ttnphns Nov 08 '17 at 08:33
  • This seems to me as a rather generic answer. To reformulate the question: I need a distance measure AND/OR an algorithm that would allow clustering while preserving the probability normalization (i.e. clustering against meaningful probability vectors.) I've been considering symmetrized Kullback-Leibler divergence, but I have doubts whether it is usable with K-means or EM, since most tutorials on this subjects lean heavily towards mixtures of Gaussian distributions. – Roger Vadim Nov 08 '17 at 09:44
  • Use HAC, or DBSCAN, or PAM, or any *distance* based method. K-means and EM are almost the only algorithms that will not work, because they won't allow aforementioned distances. (K-means should be used with squared Euclidean only, and EM uses Mahalanobis distance). – Has QUIT--Anony-Mousse Nov 08 '17 at 18:09
0

Each of your n-dimensional vectors lies on the surface of a bounded n-1 dimensional plane. A simple example is in two dimensions, where $x+y=1$ and $0 \le x,y \le 1$. One possible metric would be the cosine similarity:

$similarity = cos(\theta) = \frac{\mathbf{x}\cdot \mathbf{y}}{\vert \vert x \vert \vert \,\ \vert \vert y \vert \vert}$

which gives the angle made between the vectors. This is a similarity metric, which is 1 if $\mathbf{x}$ and $\mathbf{y}$ are equal. For a distance, use $\theta$.

I don't think there's anything wrong with using the Euclidean distance in this case though. I would try both, and see how the results compare. If your data is not too large, I'd recommend density-based clustering, like DB-scan.

user3433489
  • 353
  • 1
  • 8
  • Thanks, this seems like a promising idea. My vectors are estimates for parameters of random processes, they are probabilities by their nature. Right now I have m independent models, and estimate (n-1)m parameters. I believe however that some of these models are identical or very close to each other, i.e. there are really (n-1)k parameters, k – Roger Vadim Nov 07 '17 at 16:46
  • Cosine on probability distributions does not make a lot of sense. The vectors are normalized to sum 1, and now cosine renormalizes them to sum-of-squares 1, ruining the "probability distribution" property. – Has QUIT--Anony-Mousse Nov 08 '17 at 07:53