3

I want to measure the similarity between two hierarchical trees generated from the same n objects. The two trees are generated using the same metric at two different instants. Thus, I would like to measure how much the tree is changed. My question if there are some measures to compute this similarity? and If there are some functions in Matlab to do this comparison?

bassir
  • 148
  • 10
  • My answer below shows one of possible ways to compare trees which are hierarchical classifications on objects. Such as cluster dendrograms or cart trees. Not all trees are such classifications, though. – ttnphns Aug 12 '15 at 16:25

2 Answers2

3

There are many algorithms that compute tree-edit distances, i.e. the minimal number of atomic actions needed to transform one tree into another. In your case (since the number of objects is constant and the trees are hierarchical) it would be the number of matrix permutations. That's quite similar to sequence edit distances and alignment scores. I'm not aware of any implementations in Matlab, though I've seen some implementations in Python, R and JS. If you need some support-statistics for your estimates, you can use bootstrapping or jackknifing.

Update

One particular tree-edit distance algorithm was introduced by Zhang et. al (1989). Some implementations:

There is also a paper by Pawlik et. al (2011) with another algorithm and a speed comparison between several alternatives, including the one mentioned above.

Eli Korvigo
  • 771
  • 1
  • 6
  • 16
  • Would you care to elaborate how the edit distance of two dendrograms, where the number of objects are fixed and trees are hierarchical can be solved using matrix permutations. I have also looked around, but the implementations are mainly for string edit distances. – Amir May 13 '18 at 13:11
  • @Amir Sure, I've updated the answer. – Eli Korvigo May 13 '18 at 13:40
  • Thanks for the response. I think the Zhang et al. algorithm is for ordered labelled trees. It's a bit different with dendrograms where non-terminal nodes are unlabelled, and there is no ordering defined on subtrees. I think an unordered tree edit distance is more suitable, but I believe the general problem of unordered tree edit distance is NP-hard. – Amir May 13 '18 at 13:59
1

A similar question partly concerns with it. In short, compute correlation coefficient between the following two variables (and cases are all possible pairs of objects): (V1) how much close the two objects are on one tree, (V2) ... are on the other tree. Closeness value is the level on which the two respective clusters containing these two objects merge. As the "level", you might use the distance between the merging clusters (called colligation coefficient) or the clustering step number, or rank of the distance. Such sort of correlation between two hierarchical classifications or dendrograms is called (in biology mostly) cophenetic correlation.

For more information and nuances please search cophenetic correlation on this site and the internet. One particularly nice article is this one: James S. Farris. On the Cophenetic Correlation Coefficient.

ttnphns
  • 51,648
  • 40
  • 253
  • 462