17

There is a web service where I can request information about a random item. For every request each item has an equal chance of being returned.

I can keep requesting items and record the number of duplicates and unique. How can I use this data to estimate the total number of items?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
hoju
  • 270
  • 2
  • 7
  • 2
    What you want to estimate is not a sample size, but the size of a population (total number of unique items returned by the web sercice). – GaBorgulya Apr 06 '11 at 02:34

3 Answers3

10

This is essentially a variant of the coupon collector's problem.

If there are $n$ items in total and you have taken a sample size $s$ with replacement then the probability of having identified $u$ unique items is $$ Pr(U=u|n,s) = \frac{S_2(s,u) n! }{ (n-u)! n^s }$$ where $ S_2(s,u)$ gives Stirling numbers of the second kind

Now all you need is a prior distribution for $Pr(N=n)$, apply Bayes theorem, and get a posterior distribution for $N$.

Henry
  • 30,848
  • 1
  • 63
  • 107
  • This appears to lose some information because it does not account for the *frequencies* with which items were observed 2, 3, 4, ... times. – whuber Apr 06 '11 at 15:33
  • 3
    @whuber: It may appear not to use the information, but if you investigate further you should find that the number of unique items is a sufficient statistic. For example, if you take a sample with replacement of 4 items from a population of $n$, the probability of getting 3 of one item and 1 of another is $\frac{4}{3}$ that of getting 2 each of two items, no matter what $n$ is, so knowing the detailed frequencies gives no more useful information about the population than simply knowing there were two unique items found in the sample. – Henry Apr 06 '11 at 16:44
  • 1
    Interesting point about the sufficiency of the number of unique items. So the frequencies can serve as a check on the assumptions (of independence and equal probability), but otherwise are unnecessary. – whuber Apr 06 '11 at 18:06
4

I have already give a suggestion based on Stirling numbers of the second kind and Bayesian methods.

For those who find Stirling numbers too large or Bayesian methods too difficult, a rougher method might be to use

$$E[U|n,s] = n\left( 1- \left(1-\frac{1}{n}\right)^s\right)$$

$$var[U|n,s] = n\left(1-\frac{1}{n}\right)^s + n^2 \left(1-\frac{1}{n}\right)\left(1-\frac{2}{n}\right)^s - n^2\left(1-\frac{1}{n}\right)^{2s} $$

and back-calculate using numerical methods.

For example, taking GaBorgulya's example with $s=300$ and an observed $U = 265$, this might give us an estimate of $\hat{n} \approx 1180$ for the population.

If that had been the population then it would have given us a variance for $U$ of about 25, and an arbitrary two standard deviations either side of 265 would be about 255 and 275 (as I said, this is a rough method). 255 would have given us a estimate for $n$ about 895, while 275 would have given about 1692. The example's 1000 is comfortably within this interval.

Henry
  • 30,848
  • 1
  • 63
  • 107
  • 1
    (+1) It's interesting to note that if the ratio $s/n$ is very small then essentially no information about $n$ can be present and so one can't expect to do very well estimating $n$. If $s/n$ is very large then $U$ is a good estimator. So, we need something that works in a mid-range. – cardinal Apr 07 '11 at 13:21
  • Also $1 - (1-1/n)^s \geq (1-f_k(s/n)) / f_k(s/n)$ where $f_k(x) = \sum_{i=0}^k x^i/i!$ is the $k$th-order Taylor series approximation to $e^x$. Using $k=1$ gives an estimator $\tilde{n} = \frac{s}{s-U} U$. A continuity correction for small $s$ can be obtained by adding a constant (like 1) in the denominator. This estimator doesn't do as well for the example as numerically solving for $\hat{n}$ as you've done, though. – cardinal Apr 07 '11 at 13:22
3

You can use the capture-recapture method, also implemented as the Rcapture R package.


Here is an example, coded in R. Let's assume that the web service has N=1000 items. We will make n=300 requests. Generate a random sample where, numbering the elements from 1 to k, where k is how many different items we saw.
N = 1000; population = 1:N # create a population of the integers from 1 to 1000
n = 300 # number of requests
set.seed(20110406)
observation = as.numeric(factor(sample(population, size=n,
  replace=TRUE))) # a random sample from the population, renumbered
table(observation) # a table useful to see, not discussed
k = length(unique(observation)) # number of unique items seen
(t = table(table(observation)))

The result of the simulation is

  1   2   3 
234  27   4 

thus among the 300 requests there were 4 items seen 3 times, 27 items seen twice, and 234 items seen only once.

Now estimate N from this sample:

require(Rcapture)
X = data.frame(t)
X[,1]=as.numeric(X[,1])
desc=descriptive(X, dfreq=TRUE, dtype="nbcap", t=300)
desc # useful to see, not discussed
plot(desc) # useful to see, not discussed
cp=closedp.0(X, dfreq=TRUE, dtype="nbcap", t=300, trace=TRUE)
cp

The result:

Number of captured units: 265 

Abundance estimations and model fits:
                  abundance       stderr      deviance   df           AIC
M0**                  265.0          0.0  2.297787e+39  298  2.297787e+39
Mh Chao              1262.7        232.5  7.840000e-01    9  5.984840e+02
Mh Poisson2**         265.0          0.0  2.977883e+38  297  2.977883e+38
Mh Darroch**          553.9         37.1  7.299900e+01  297  9.469900e+01
Mh Gamma3.5**  5644623606.6  375581044.0  5.821861e+05  297  5.822078e+05

 ** : The M0 model did not converge
 ** : The Mh Poisson2 model did not converge
 ** : The Mh Darroch model did not converge
 ** : The Mh Gamma3.5 model did not converge
Note: 9 eta parameters has been set to zero in the Mh Chao model

Thus only the Mh Chao model converged, it estimated $\hat{N}$=1262.7.


EDIT: To check the reliability of the above method I ran the above code on 10000 generated samples. The Mh Chao model converged every time. Here is the summary:
> round(quantile(Nhat, c(0, 0.025, 0.25, 0.50, 0.75, 0.975, 1)), 1)
    0%   2.5%    25%    50%    75%  97.5%   100% 
 657.2  794.6  941.1 1034.0 1144.8 1445.2 2162.0 
> mean(Nhat)
[1] 1055.855
> sd(Nhat)
[1] 166.8352
GaBorgulya
  • 3,253
  • 15
  • 19
  • It seems some justification for the use of capture-recapture models is needed, because this is not a standard capture-recapture experiment. (Possibly it can be viewed as 300 capture events, but the call to closedp does not seem to indicate that.) – whuber Apr 06 '11 at 03:23
  • @whuber Yes, I did view the example as 300 capture events. How do you mean that "the call to closedp does not seem to indicate that"? I appreciate (constructive) criticism and I'm happy to correct (or delete if necessary) my answer if it turns out to be wrong. – GaBorgulya Apr 06 '11 at 09:45
  • this seems a reasonable approach. However I won't be using R so need to understand the maths behind it. The wiki page covers a 2 event situation - how do I apply it to this case? – hoju Apr 06 '11 at 10:57
  • 1
    @Ga I see: You created a 300 x 300 matrix for the data! The inefficiency of this code fooled me: it would be simpler and more direct to use `closedp.0(Y, dfreq=TRUE, dtype="nbcap", t=300)' where Y is the frequency matrix {{1,234}, {2,27}, {3,4}} (which you computed twice and actually displayed!). More to the point, the convergence failures are alarming, suggesting there are problems with the underlying code or models. (An exhaustive search of the [docs](http://cran.r-project.org/web/packages/Rcapture/Rcapture.pdf) for "M0" turns up no references or description for this method...) – whuber Apr 06 '11 at 15:32
  • @whuber The matrix is 265x300, there were 256 unique values observed in this example. – GaBorgulya Apr 06 '11 at 18:48
  • @whuber It was the Mh Chao model that converged, the classical one. I haven't read the original paper, but it may be this one: http://www.jstor.org/pss/2531532 . – GaBorgulya Apr 06 '11 at 19:02
  • 1
    @whuber I simplified the code following your suggestion (dfreq=TRUE, dtype="nbcap", t=300). Thanks again. – GaBorgulya Apr 06 '11 at 19:18
  • @whuber Do you still find the convergence failures alarming? The Mh Chao model converged in 10000 of 10000 simulations, see the edit. – GaBorgulya Apr 08 '11 at 04:20