3

This question is similar to the one asking for the expected number of unique elements drawn from a uniform distribution but with the difference I am interested in calculating (or at least approximating) the variance.

Background:

Assume we have $n$ bins and $k$ sampling trials with replacement. Each bin has a probability to be sampled of $\frac{1}{n}$. We may want to describe P(X) as the probability that there are exactly X bins being sampled once. Once a bin is sampled a ball is inserted in the bin and the process resumes until all the trials have been performed. A common task is to find the expected number of unique bins ever visited after $k$ attempts. Usually people solve this problem in terms of random indicator variables.

Assume $I_j = 1$ if the $j$-th bin is picked in $k$ trials and 0 otherwise. It is known that since our solution is $E[X] = E[\sum_j{I_j}]$ and that $E[I_j] = \left(1 - \frac{(n-1)}{n}\right)$ then the expected total number is trivially $E[\sum_j{I_j}] = n\left(1 - \left(\frac{(n-1)}{n}\right)^k\right)$.

The question

What is instead $Var[X]$? More generally, what is the PMF of $P(X)$?

Past approaches

At first I assumed that all the $I_j$ were i.i.d. therefore one could simplify assume that $Var[\sum_j{I_j}] = \sum_j{Var[I_j]}$ but in my opinion those are not iid. If for example k = 1 and a certain $I_j$ is 1, all the other $I_{j'}$ with $j' \neq j$ must be 0.

Then, I tried to do some heavy computing by calculating $Var(X) = E[X^2] - E[X]^2$. What is $E[X^2]$ then?

By doing some algebra, I devised that $E[X^2] = E[\sum_j{I_j^2} + 2 \binom{n}{2}\sum_j \sum_{j' > j}I_jI_{j'}] $. Clearly, by linearity of expectation one may try to compute those separately. For example, $E[\sum_j{I_j^2}] = E[\sum_j{I_j}] = E[X]$ because $I_j$ can only be 0 or 1.

But now I am quite unsure about what should I do with the last term. In all my naïvety I tried to break this product into the following 4 cases:

  1. $I_j, I_{j'} = 0$. Because the two variables are not iid. I define $P(I_j=0, I_{j'}=0) = P(I_j=0| I_{j'}=0)P(I_{j'}=0)$. The last factor should be $\left(\frac{n-1}{n}\right)^k$ and the previous one $\left(\frac{n-2}{n-1}\right)^k$ because the $j-1$-th element is never selected, but the (failing) trials are always $k$. Thus $P(I_j=0, I_{j'}=0) = \frac{((n-1)(n-2))^k}{n^{2k}}$

  2. $I_j=0, I_{j'} = 1$. With a similar line of reasoning like above, $P(I_{j'}) = (1 - (\frac{n-1}{n})^k)$ and $P(I_{j}|P_{j'}) = (\frac{n-1}{n-2})^(k-1)$. This time k decreases because we know for sure one attempt was successful. Thus $P(I_j=0, I_{j'}=1) = (\frac{n-1}{n-2})^{k-1}(1 - (\frac{n-1}{n})^k)$

  3. $I_j=1, I_{j'} = 0$. With a similar line of reasoning like for case 2 I conclude that $P(I_j=1, I_{j'}=0) = (1 - (\frac{n-1}{n-2})^k)(\frac{n-1}{n})^k$

  4. $I_j, I_{j'} = 1$. This is ultimately the case we are looking for as this is the only non-null product. Its probability should be 1 minus the sum of all the other cases, i.e. $P(I_j=1, I_{j'}=1) = 1 - \left(\frac{((n-1)(n-2))^k}{n^{2k}} + (\frac{n-1}{n-2})^{k-1}(1 - (\frac{n-1}{n})^k) + (1 - (\frac{n-1}{n-2})^k)(\frac{n-1}{n})^k\right)$. Here is a wolfram link if you are interested.

Therefore one could plug this back into our variance difference formula and obtain

$Var(X) = E[X] + n(n-1)P(I_j=1,I_{j'}=1) - E[X]^2$

But this seems to not be correct. Am I missing an edge case, or am I applying some law incorrectly?

What I have not tried (yet)

Trying to solve $P(X \geq x)$ or vice versa and defining $P(X=x)$ as $P(X \geq x + 1) - P(X \geq x)$. But I do not directly see if this would help me find the variance.

1 Answers1

2

I thought these were well known for the number of those sampled at least once:

  • $E[X] = n- n\left(1-\frac1n\right)^k$ - equal to what you wrote

  • $\text{Var}[X] = n\left(1-\frac1n\right)^k + n^2\left(1-\frac1n\right)\left(1-\frac2n\right)^k- n^2\left(1-\frac1n\right)^{2k}$

  • $P(X=x) = \dfrac{n!\, S_2(k,x)}{(n-x)!\, n^k}$ where $S_2(k,x)$ is a Stirling number of the second kind

Henry
  • 30,848
  • 1
  • 63
  • 107