Similarity between sets with different size

Question

Is there a distance measure like jaccard for sets with different sizes? For example A=['a','b','c'] and B=['a','d']

I would like to include the total intersection as well as the order.

The implementation of jaccard similarity score in Pythons Sklearn only supports lists of same shape.

The mathematical answer to such a question is "of course" and it would go on to point out there is an infinite variety of possibilities. But that begs the statistical context: what is this "similarity" supposed to measure? You need to tell us that in order to get anything that might truly be useful to you. — whuber, Jan 29 '16 at 14:32
i edited my post. I would like to calculate a score which takes the intersection and the order into account. — J-H, Jan 29 '16 at 15:02
Thank you. But the question is still vague and still has too many possible, drastically different solutions. What statistical problem are you trying to solve? — whuber, Jan 29 '16 at 15:07
Thanks for your answer. I want to measure the quality of a classification. The result is a list containing different categories for each predicted sample like B=[a,b,c] C=[a,b] D=[c]. I want to compare these sets to my grounded truth set A = [a,b,c]. Therefore B should return the "highest" value and D the lowest because there is only one intercept and also at the wrong position (D=[c,none,none]) — J-H, Jan 29 '16 at 15:16
How do you get a prediction with only one element 'D=[c]' while the ground truth is 'A=[a,b,c]' with three elements? So somehow not only the classes are determined but also the number of classes (and the position?) but how and why? Is the position relevant?For 'sets' the order and position it is not defined, the order is irrelevant, but maybe you mean a 'sequence". In the case of a sequence, how do you deal with positions? Is the prediction D stating that the first element is 'c' which is incorrect or is it stating that the last element is 'c' which is correct. — Sextus Empiricus, Nov 21 '21 at 10:36

luchonacho · Answer 1 · 2021-10-19T16:35:04.253

The Jaccard coefficient does allow for different set size, but its interpretation becomes less intuitive.

Here is an application for R.

jaccard <- function(a, b) {
      intersection = length(intersect(a, b))
      union = length(a) + length(b) - intersection
      return (intersection/union)
    }
    
jaccard(a,b)

And here is one for Python:

def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union
    
jaccard(a,b)

(code from here)

Another option is the Sørensen–Dice coefficient. The nominator is twice the intersection set, and the denominator is the sum of the cardinality of both sets. To apply it, just change the two codes above accordingly.

The overlap coefficient, or Szymkiewicz–Simpson coefficient is one alternative that does not care about different set sizes. As long as one subset is contained in the other, the coefficient is one. To apply it, just change the two codes above accordingly.

Disregarding the dimension of the large set might not be ideal though. Since the index only cares about the proportion of the small set in the large one, value is the same regardless of the dimension of the latter (10, 1000, 1000000).

I'm sure there are many more metrics. The comment by @whuber is correct. There are many metrics. It depends on what you are after.

Similarity between sets with different size

1 Answers1