8

Exercise 3.1.3 : Suppose we have a universal set U of n elements, and we choose two subsets S and T at random, each with m of the n elements.

What is the expected value of the Jaccard similarity of S and T ?

I am reading the book http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Soroush
  • 105
  • 5
delete me
  • 91
  • 1
  • 5

5 Answers5

7

The above answer assumes that an element in $T$ may be repeated several times in $S$ (i.e. $S$ and $T$ are not sets but multisets); else the probability will not be $m/n$ uniformly.

I expect the answer should be more along the following lines:-

Let the number of common elements between $S$ and $T$ be $k$. Then, as mentioned by ack_inc in the comment to his answer, Jaccard similarity $Sim(S,T)=k/(2m-k)$.

Now, $Pr(Sim(S,T)=k/(2m-k))$ will be $\dfrac{{m\choose {k}} {n-m\choose m-k}}{n\choose m}$ since there are $n$ total elements, of which $m$ are in $S$ and $k$ are common. So the number of ways we can choose $m$ elements for $T$ is given by ${m \choose k}$ (choosing the $k$ common elements from $S$) times ${n-m\choose m-k}$ (choosing remaining $m-k$ elements).

Thus, $E(Sim(S,T))=\sum_{k=0}^{m} \dfrac{k}{2m-k} \dfrac{{m\choose {k}} {n-m\choose m-k}}{n\choose m}$.

However, simplifying the above expression is beyond my limited knowledge of combinatorial identities. If anyone can do so, kindly update the answer.

krypto07
  • 171
  • 1
  • 2
  • When size of $T$ is same as the size of $U$ (each of size `n`), then wny not $S\subset T$? – CKM Apr 06 '20 at 06:18
4

I'm posting an alternative solution.

Jaccard similarity of two sets $S$ and $T$ is defined as the fraction of elements these two sets have in common, i.e. $\text{sim}(S,T)=|S\cap T|/|S\cup T|$. Suppose we chose $m$-element subsets $S$ and $T$ uniformly at random from an $n$-element set. What is the expected Jaccard similarity of these two sets? Suppose the $|S\cap T|=k$ for some $0\le k\le m$. Notice that for the first set, $S$, we have $\binom{n}{m}$ choices, while for $T$ we have $\binom{m}{k}\binom{n-m}{m-k}$ choices, because $k$ elements must be from $S$ and $m-k$ elements must not be from $S$. This gives us $$\Pr[|S\cap T|=k]=\frac{\binom{m}{k}\binom{n-m}{m-k}}{\binom{n}{m}},$$ meaning that $$\text{E}[\text{sim}(S,T)]=\sum_{k=0}^m\frac{\binom{m}{k}\binom{n-m}{m-k}}{\binom{n}{m}}\frac{k}{2m-k}.$$ Even though $\text{E}[|S\cap T|/|S\cup T|]\neq\text{E}[|S\cap T|]/\text{E}[|S\cup T|]=m/(2n-m)$, this expression seems to give good approximation.

Thanks to Mitja Trampus for pointing out an alternate solution, with $$\Pr[|S\cap T|=k]=\binom{m}{k}\frac{\binom{m}{k}}{\binom{n}{k}}\frac{\binom{n-m}{m-k}}{\binom{n}{m}},$$ giving the following expression: $$\text{E}[\text{sim}(S,T)]=\sum_{k=0}^m\binom{m}{k}\frac{\binom{m}{k}}{\binom{n}{k}}\frac{\binom{n-m}{m-k}}{\binom{n}{m}}\frac{k}{2m-k}.$$

(The above expressions are, of course, equivalent.)

EDIT: Regarding the simplification, perhaps applying the following identity (from Aigner's book, page 13) could work: $$\binom{n}{m}\binom{m}{k}=\binom{n}{k}\binom{n-k}{m-k}.$$

user12344567
  • 141
  • 1
  • 6
  • There is a problem. What exactly do you do with Pr [S union T]? How do you go from E [ S int. T / S union T] to the second equation? – user13107 Apr 21 '14 at 08:23
  • Um, which equation exactly? Once I know the size of the intersection is $k$, I know the size of the union must be $2m-k$. So I view the size of the intersection $|S\cap T|$ as a random variable. Once I compute $\Pr[|S\cap T|=k]$, I just use the definition of the expectation: $E[sim(S,T)]=\Pr[|S\cap T|=1]/(2m-1)+\ldots+\Pr[|S\cap T|=m]$. Does this answer your question? – user12344567 Apr 21 '14 at 09:32
3

Each item in T has an $\frac{m}{n}$ chance of also being in S. The expected number of items common to S & T is therefore $\frac{m^2}{n}$.

Exp. $\text{Jaccard Similarity} = \dfrac{\text{No. of common items}}{\text{Size of T} + \text{Size of S} - \text{Number of common items}} = \dfrac{m}{2n - m}$ (after simplification.)

Soroush
  • 105
  • 5
ack_inc
  • 162
  • 3
  • The expected number of items common to S & T is therefore m^2 / n How?+ – delete me Dec 16 '13 at 11:41
  • Expected value is calculated as Sigma x.p(x), summing over each of the m elements in set T. Each common item you find adds 1 to the common item total, so in our case, x = 1 and p(x) is m / n. – ack_inc Dec 16 '13 at 12:16
  • In my opinion, this is wrong or at least, not sufficiently explained. Reasoning that E[X/Y] = E[X] / E[Y] is not valid in general. – Antoine Mar 25 '16 at 15:50
  • I believe this is wrong. The detailed solution given here is very informative: http://www.l3s.de/~anand/lsdm16/lectures/homework1-simitems-sultions.pdf – Joseph Jan 21 '18 at 01:44
  • @Joseph The link is dead – Antoine Feb 25 '19 at 14:14
  • 1
    Incorrect. Counter example: m=2, n=3. You can compute by hand that the correct answer is 5/9, this formula yields 1/2. As pointed out by Antoine, the E[X/Y] = E[X] / E[Y] derivation is incorrect. – Valentin Waeselynck Apr 09 '19 at 19:15
  • The link from the comment is dead, here is a [link to an archive](https://web.archive.org/web/20171011072312/http://www.l3s.de/~anand/lsdm16/lectures/homework1-simitems-sultions.pdf) with a rigorous solution. – Alex Nikiforov Jul 05 '19 at 05:53
  • According to the Example 3.2 in the book, the size of the union should always the sum of the sizes of the two sets. So I think the denominator should be 2m. Numerator is m^2/n. So SIM= m/2n. – Elaine Dec 21 '15 at 00:19
0

I agree with blazs answer - just want to add a small correction (credit to another guy in the course who pointed it out).

The summation does not start at 0. (you'll see it if you make n=100 and m=99)

$$ \text{E}[\text{sim}(S,T)]=\sum_{k=max(0, 2m-n)}^m\binom{m}{k}\frac{\binom{m}{k}}{\binom{n}{k}}\frac{\binom{n-m}{m-k}}{\binom{n}{m}}\frac{k}{2m-k}. $$

GM1313
  • 1
  • 1
0

I just want to add the following:

As pointed out, ack_inc's answer is not correct and can serve only as an approximation. Also, the lower bound should be $\max\{0, 2m - n\}$ instead of $0$, as GM1313 mentions.

I needed to compute the similarities for $1\leq m\leq n$ where $n = 5000$ or even bigger, so computing all the probabilities $P_{n, m}[|S\cap T| = k]$ (blindly following the definition)

  • takes a lot of time,
  • gives wrong results due to floating-point arithmetics, e.g., $p = 0.0$.

Therefore, I used the fact that ${a \choose b + 1} = \frac{a - b}{b + 1}{a \choose b}$ to speed up the process, and compute the probabilities recursively, starting with the biggest one.

I also used the approximation $m / (2n - m)$ and actually, it is really good, especially for larger values of $n$ (curves for $n\in\{20, 100, 5000\}$):

Comparison for n = 20 Comparison for n = 100 Comparison for n = 5000

Antoine
  • 738
  • 5
  • 15