2

I seems to me that the clustering methods $k$-means, dbscan, and hierarchical clustering all work on distance measures $d$ that are (pseudo)metrics, i.e., fulfill the following requirements: $$ d(x,x)=0 $$ $$ d(x,y) = d(y,x) $$ $$ d(x,z) \leqslant d(x,y) + d(y,z) $$

I am wondering whether this algorithms also work on distance measures between two datapoints that do not fulfill those requirements, for example by not fulfilling the triangle inequality?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
  • Lewian has given a good answer. Just to add on the topic: About k-means, look [here](https://stats.stackexchange.com/q/81481/3277). About hierarchical linkages, look one of paragraphs [here](https://stats.stackexchange.com/a/217742/3277). – ttnphns Aug 23 '20 at 10:40

1 Answers1

1

$k$-means in its standard form uses the Euclidean distance. This is necessary because otherwise the optimally cluster-representing centroids would not be means and the name $k$-means wouldn't be justified. Unfortunately these days a number of authors use the term $k$-means for something more general involving other types of distances, but that is misleading use of terminology.

In principle you can use properly distance-based methods such as dbscan and single, average, or complete linkage hierarchical clustering (but not Ward's method, which like $k$-means relies on the Euclidean distance) with general dissimilarities that do not fulfil the triangle inequality. Whether this is appropriate depends on the specific situation and dissimilarity. I suspect that dbscan may produce results that are hard to interpret, because it is based on nieghbourhoods, and the concept of a neighbourhood becomes weird if it is possible that $d(x,z)=100$ but $d(x,y)=d(y,z)=0.1$ so that $x$ and $z$ both are in a close neighbourhood of $y$ but are extremely far from each other. That said, some dissimilarities that violate the triangle inequality only do so in rather mild ways.

Christian Hennig
  • 10,796
  • 8
  • 35