1

In the simplest case given a set of N items is the distribution for the number of draws with replacement before all items are seen? This is the case I really need.

More generally what is the distribution for the number of draws from a multinomial distribution before all possible outcomes are seen?

This looks somewhat like a generalization of the negative binomial distribution where the success probability is changing.

Daniel Mahler
  • 631
  • 4
  • 6
  • 3
    This is called the "coupon collector's problem". This question seems like a good start, although maybe not quite what you're looking for https://stats.stackexchange.com/questions/547372/intuition-about-the-coupon-collector-problem-approaching-a-gumbel-distribution – David Luke Thiessen Jan 10 '22 at 01:29

1 Answers1

1

In the case where you sample using simple-random-sampling-with-replacement (SRSWR) (i.e., when each item is selected with the same probability) this is the Coupon-collector problem. If we let $T$ denote the excess number of draws required to sample all items (i.e., the number of draws beyond the minimum of $N$ draws), then the mass function for this random variable is:

$$\text{CoupColl}(t|N) = \frac{N!}{N^{N+1}} \cdot S(N+t-1, N-1) \quad \quad \quad \text{for } t = 0,1,2,...,$$

where $S(\cdot, \cdot)$ denotes the Stirling numbers of the second kind. This distribution is a special case of the "negative occupancy distribution" that is programmed in the occupancy package in R. (To get the coupon-collector distribution, just use the negative occupancy parameter set equal to the space parameter, which in this case is $N$.) Here is an example of the distribution for $N=10$.

#Load library
library(occupancy)

#Set parameters
N <- 10

#Compute mass function
PROBS <- dnegocc(0:100, space = N, occupancy = N)
names(PROBS) <- 0:100

#Plot the mass function
LABEL <- rep(NA, length(PROBS))
SEQ   <- seq(1, length(PROBS), by = 10)
LABEL[SEQ] <- names(PROBS)[SEQ]
barplot(PROBS, names.arg = LABEL, col = 'blue',
        xlab = 'Excess number of draws', ylab = 'Probability')

enter image description here

Ben
  • 91,027
  • 3
  • 150
  • 376