5

Can I use Jaccard index to calculate similarity between set and multiset?

As I know Jaccard is defines as the size of the intersection divided by the size of the union of the sample sets,

that is $J(A, B) = |A \cap B| \, / \, |A \cup B|$

Now if I have a set $s$, $s = \{\text{special}, \text{words}\}$ and a multiset $m$, $m = \{\text{term}, \text{special}, \text{words}, \text{special}\}$

How can I use Jaccard index to take repetition into consideration?

Alexey Grigorev
  • 8,147
  • 3
  • 26
  • 39
Arwa
  • 151
  • 1
  • 4

2 Answers2

4

You can use Generalized Jaccard Index, and assume that the set $s$ is actually a multiset:

If $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ and $\mathbf{y} = (y_1, y_2, \ldots, y_n)$ are two vectors with all real $x_i, y_i \geq 0$, then their Jaccard similarity coefficient is defined as $$J(\mathbf{x}, \mathbf{y}) = \frac{\sum_i \min(x_i, y_i)}{\sum_i \max(x_i, y_i)}.$$

Here you can read "vector" as "multiset", and $x_i$ is a count of element $i$ in the multiset $\mathbf x$.

Alexey Grigorev
  • 8,147
  • 3
  • 26
  • 39
  • but in this case shouldn't x and y have the same size "n" which is not my case? for me s and m have different size. – Arwa Jul 22 '15 at 14:43
  • Think of multisets as vectors, all of the same size, indexed by all possible words that you can have in your sets. Then index $i$ would go over all of them, and if both multisets don't have $i$ inside, it will contribute nothing to the total sum. – Alexey Grigorev Jul 22 '15 at 19:41
  • Okay, I think I got it. So, from the example I provided, "special" appears in the multiset "s" once and in "m" twice, "words" appears once in each. So, J(s, m) = 1 + 1 / 2 + 1 = 2/3. right? – Arwa Jul 22 '15 at 19:52
  • Yes, it's (1 + 1) / (2 + 1) – Alexey Grigorev Jul 23 '15 at 08:50
  • A question came into my mind, now we assume that "s" is a multiset, however, it is actually a set and elements inside it will always occur once. So, min(x,y) will always have the value one, will that not going to affect the final result? – Arwa Jul 26 '15 at 09:03
  • well since $s$ is a set, I believe it's fine – Alexey Grigorev Jul 26 '15 at 09:09
  • wouldn't it be ( 1 + 1 ) / ( 2 + 1 + 1 ) since it's a union in the denominator? – scottlittle Sep 30 '19 at 02:32
2

How do you want to take the repetition into account? I sew a few ways you could do it:

  1. You can just ignore it, the resulting index value would be $J(s,m)=2/3$ ;
  2. You can count the repeated words as if they were different words. Then it would be $J(s,m)=2/4$, now the union is 4 because 'special' appears twice in the union;
  3. You can count the repeated words with a different weight. For instance 0.5, $J(s,m)=2/3.5$, 'special' gets a weight of 1.5, 1 for the first appearance and 0.5 for the second.

But I would say it depends on what you are trying to do and how you are approaching your problem.

João Almeida
  • 233
  • 1
  • 6