Appropriate distance measure for clustering paths in a graph

Question

I have many paths which come from the same graph. I am trying to cluster these paths. First, I thought of using simply the Levenshtein distance. The problem is that two very short paths which do not have any node in common have a smaller distance than a very short and very long path which contains the shorter path (e.g., A-B-C, X-Y-Z, A-B-C-D-E-F-G-H-I-J-K-L).

I would like to cluster the paths when they have nodes / waypoints in common. Also some nodes can be more important than others.

I am not very familiar with standard distance metrics in the mentioned case. What would be a good distance metric? Do you have some good resources for me?

Represent the paths as strings and compute the LCSS - longest common subsequence - sources: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem — Nikolas Rieble, Jan 17 '17 at 13:59
@NikolasRieble, why not develop that into an official answer? — gung - Reinstate Monica, Jan 17 '17 at 16:02
This question seems perfectly clear to me. I see no reason it needs to be closed. — gung - Reinstate Monica, Jan 17 '17 at 16:02
@gung I wanted to share my thoughts, didnt consider them to be worth an answer. I elaborated it into an answer. — Nikolas Rieble, Jan 17 '17 at 19:00

Nikolas Rieble · Answer 1 · 2017-01-17T19:28:02.833

One way to cluster the paths could be to represent them as strings such as

path1 = 'ABC' and

path2 = 'ABDC'.

Then string similarity measures could be used such as LCSS - longest common subsequence. This website shows an example calculator also showing the source code, allowing you to copy paste it. Although this probably would yield good results for clustering, it does not fulfill metric requirements as the LCSS length between two equal paths is not zero.

There are also other string measures, which might even fit better such as:

EDIT Distance (probably best in this case)
Jaccard Index
Levenshtein Distance

Which fits bets, only you tell, by using them all and comparing the clustering result.

score 1 · Answer 2 · answered Jan 17 '17 at 21:23

I would suggest the complexity and information theoretic approach developed by Andreas Brandmaier, permutation distribution clustering (PDC). His method is well developed in several papers and has an R module for implementation (e.g., here ... https://www.jstatsoft.org/article/view/v067i05/v067i05.pdf). Basically, this involves computing the pairwise mutual information matrix and, then, inputting the resulting matrix into a clustering algorithm of the analyst's choice. It's a flexible method and can be made to work with data as messy as it sounds like yours may be.

score 0 · Answer 3 · answered Jan 17 '17 at 21:07

0

Rather than clustering, you should be looking for

frequent common subsequences

as this is more tolerant to noise in your data.

answered Jan 17 '17 at 21:07

Has QUIT--Anony-Mousse

39,639
7
61
96

Appropriate distance measure for clustering paths in a graph

3 Answers3

Linked