1

I have many paths which come from the same graph. I am trying to cluster these paths. First, I thought of using simply the Levenshtein distance. The problem is that two very short paths which do not have any node in common have a smaller distance than a very short and very long path which contains the shorter path (e.g., A-B-C, X-Y-Z, A-B-C-D-E-F-G-H-I-J-K-L).

I would like to cluster the paths when they have nodes / waypoints in common. Also some nodes can be more important than others.

I am not very familiar with standard distance metrics in the mentioned case. What would be a good distance metric? Do you have some good resources for me?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
nordpol
  • 11
  • 1

3 Answers3

2

One way to cluster the paths could be to represent them as strings such as

path1 = 'ABC' and

path2 = 'ABDC'.

Then string similarity measures could be used such as LCSS - longest common subsequence. This website shows an example calculator also showing the source code, allowing you to copy paste it. Although this probably would yield good results for clustering, it does not fulfill metric requirements as the LCSS length between two equal paths is not zero.

There are also other string measures, which might even fit better such as:

Which fits bets, only you tell, by using them all and comparing the clustering result.

Nikolas Rieble
  • 3,131
  • 11
  • 36
1

I would suggest the complexity and information theoretic approach developed by Andreas Brandmaier, permutation distribution clustering (PDC). His method is well developed in several papers and has an R module for implementation (e.g., here ... https://www.jstatsoft.org/article/view/v067i05/v067i05.pdf). Basically, this involves computing the pairwise mutual information matrix and, then, inputting the resulting matrix into a clustering algorithm of the analyst's choice. It's a flexible method and can be made to work with data as messy as it sounds like yours may be.

Mike Hunter
  • 9,682
  • 2
  • 20
  • 43
0

Rather than clustering, you should be looking for

frequent common subsequences

as this is more tolerant to noise in your data.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96