9

Let us say that we define a distance, which is not a metric, between N items.

Based on this distance we then use an Agglomerative hierarchical clustering.

Can we use each of the known algorithm (single/maximum/avaerage linkage etc), to get meaningful results? Or put differently, what is the problem with using them if the distance is not a metric?

Raffael
  • 1,424
  • 3
  • 18
  • 30
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
  • What are "items" in your case? (I'm asking whether it has anything to do with psychometrics because if this is the case, I would recommend having a look at [item clustering](http://www.personality-project.org/r/html/ICLUST.html), or Revelle, W. [Hierarchical cluster analysis and tihe internal structure of tests](http://personality-project.org/revelle/publications/iclust.pdf), MBR (1979) 14:57.) – chl Sep 19 '11 at 19:47

2 Answers2

7

Requirements for distances depend on method of hierarchical clustering. Single, complete, average methods need distances to be no-negative and symmetric. Ward, centroid, median methods need (squared) euclidean (which is even narrower definition than metric) distances to produce geometrically meaningful results.

(One can check if his/her distance matrix is euclidean by doubly centering it [see my reply here] and looking at the eigenvalues; if no negative eigenvalues found then the distances do converge in euclidean space.)

ttnphns
  • 51,648
  • 40
  • 253
  • 462
  • Thanks. Further question: does the triangle inequality has to hold for single, complete, average methods? and if some distance is (for example) not symmetric, what problem does it pose to these methods? (Thanks!) – Tal Galili Aug 05 '11 at 20:07
  • 1
    Classical hierarchical clustering methods _can take in_ nothing but symmetrical matrix: a distance from A to B = from B to A. Special other methods exist to deal with asymmetrical (you may google). As for triangular inequality - it is not necessary condition for the methods you mention. (However, common wisdom thinks of "distance" as smth with the inequality, so it is worth considering to impose it if it's missing. To do it, iteratively add small constant to the distances and check. And if you continue to add upon reaching it then you will soon arrive at euclidean distances) – ttnphns Aug 06 '11 at 06:43
5

No, the distance doesn't have to be a metric. It can, for instance, be an ultrametric: $$d(A, B) \le \max(d(A, C), d(B, C))$$

Ultrametric distances obtained from successive steps in the clustering algorithm can be represented using dendrograms, which you may have seen in this context.

Hong Ooi
  • 7,629
  • 3
  • 29
  • 52
  • Thank you Hong. I remember that methods to transform some objects to hclust demand that the dendrogram is ultrametric - I wounder if this has to do with what you wrote. In any event, thank you for the answer. – Tal Galili Aug 05 '11 at 20:08