0

In presentations of k-means to compute the centroid of a new cluster the Euclidean average seems to always be used. If the similarity metric used is not the Euclidean metric it seems to me this other metric should be used to determine the centroid.

So my questions are:

1) what results are there that indicate that using the Euclidean average with other similarity metrics still works well,

2) what are general methods to compute a centroid w.r.t different metrics,

3) in my particular situation the similarity metric that seems most applicable is the Jaccard similarity; do the answers change with that additional stipulation?

This is a specific instance of a more general question that I can't formulate yet, but it does seem that the Euclidean metric is used more often than fits with my intuition.

kasterma
  • 126
  • 4
  • Technically K-means algorithm doesn't deal with (dis)similarities at all. Implicitly it is based on the euclidean distance. The concept of "mean" or "centroid" implies euclidean space. The concept could be extended over to some other spaces of course, but this will mean to have to modify the algorithm greatly. It won't be "K-means" anymore. – ttnphns Feb 07 '13 at 16:31
  • 1
    @ttnphns this has been done, and it's called k-medoids. Indeed it is not k-means anymore, as you can't rely on the mean to minimize your objective function if it is an arbitrary distance function. – Has QUIT--Anony-Mousse Feb 08 '13 at 07:36

0 Answers0