Confidence interval for the number of trials before you've observed each outcome in the sample space?

Question

My friend collects stickers. There are 200 unique stickers to be sticked into the album. As the album gets fuller, it gets less and less likely that any newly obtained sticker is not already in the album.

I assumed every sticker to be obtained equally likely and figured the expected number of stickers necessary to be collected in order to completely fill the album, that is so that 200 unique stickers have been obtained, would be the sum of the expectation of 200 geometric random variables:

$E(n) = \sum_{x=1}^{200} \frac{200}{x} \approx 1176$

Is this correct?
I computed the variance, under the (in her case, very unrealistic) assumption of independence, as the sum of the variances of these random variables. Is this correct? $Var(n) = \sum_{x=1}^{200} \frac{1-\frac{x}{200}}{(x/200)^2} \approx 64422$ .
How would I give her a confidence interval on the number of stickers she'd most likely have to collect to make the album full?

PS: This isn't homework as I'm asking out of curiosity, but I found no appropriate tag and think it kind of looks a bit like homework.

#1 is answered at http://stats.stackexchange.com/questions/29364. But how did you determine that *geometric* variables are involved here? — whuber, Jun 26 '12 at 15:10
What do you mean when you ask for a confidence interval in #3? CIs are computed using observed data, which you don't seem to have here. — MånsT, Jun 26 '12 at 16:12
@whuber: For each number of stickers n she already has she keeps drawing new stickers until she gets one she doesn't already have. This seems like a geometric variable because it's the number of X Bernoulli trials needed to get one success. The probability P is fixed for each n. Each sticker represents a trial. The expected number of stickers needed per n should be the mean of a geometric distribution with parameter P. The expected number needed to fill the album should be the sum of the means of the 200 geometric distributions. No? — miura, Jun 27 '12 at 07:07
@MånsT: Right. What I mean is I want to be able to tell her: "You're most likely going to need X stickers, and I'm 95% sure it will be no less than Y and no more than Z". — miura, Jun 27 '12 at 07:11
@whuber: Thank you for the link. This also seems very related, or even the same problem: http://en.wikipedia.org/wiki/Coupon_collector%27s_problem — miura, Jun 27 '12 at 07:12
Yes, the Wikipedia article discusses the same problem. It falls a little short in presenting asymptotic formulas for the tail probabilities (the CDF and CCDF). You need the *inverses* of these functions. — whuber, Jun 27 '12 at 13:23
With apologies for the self-citations, here are a couple of related questions: (**1**) [Birthday-coverage problem](http://math.stackexchange.com/q/26772/7003) and (**2**) [What is a tight lower bound on the coupon collector time?](http://stats.stackexchange.com/q/7774/2970). The first asks a version of your same question and the second asks something related, with answers and comments providing the information necessary to construct bounds. The bounds are a little weak, but are nonasymptotic. — cardinal, Jul 05 '12 at 12:42

score 3 · Accepted Answer · answered Jul 05 '12 at 01:53

3

What you want is more of a prediction interval than a confidence interval (you are not estimating a population parameter).

Here is a simulation approach:

tmpfun <- function() {
    book <- numeric(200)
    while( any(book==0) ) {
        tmp <- sample(200,1)
        book[tmp] <- book[tmp] + 1
    }
    return(sum(book))
}

> out <- replicate( 100000, tmpfun() )
> mean(out)
[1] 1175.593
> var(out)
[1] 64697.3
> quantile(out, c(0.025, 0.975) )
 2.5% 97.5% 
  806  1792

answered Jul 05 '12 at 01:53

Greg Snow

46,563
2
90
159

It's beautiful. – miura Jul 05 '12 at 06:39
I am interested in something similar, but instead of requiring to see all the stickers I would like to see all the stickers with probability at least $1-\alpha$. How do you handle this case? – BigG Dec 18 '12 at 17:01
@BigG, what do you mean by seeing all the stickers with probability? This might be best as a whole new question (you can link to the one above for background) with more detail on what you are looking for. – Greg Snow Dec 19 '12 at 16:31

Confidence interval for the number of trials before you've observed each outcome in the sample space?

1 Answers1