Estimate the size of a population being sampled by the number of repeat observations

Question

Say I have a population of 50 million unique things, and I take 10 million samples (with replacement)... The first graph is I've attached shows how many times I sample the same "thing", which is relatively rare as the population is larger than my sample.

However if my population is only 10 million things, and I take 10 million samples, as the second graph shows I will more often sample the same thing repeated times.

My question is - from my frequency table of observations (the data in the bar charts) is it possible to get an estimate of the original population size when it is an unknown? And it would be great if you could provide a pointer to how to go about this in R.

alt text

See https://space.stackexchange.com/questions/41547/how-do-we-know-what-percentage-of-neos-weve-discovered for an interesting application — kjetil b halvorsen, Feb 20 '20 at 19:28

score 10 · Accepted Answer · edited Sep 08 '10 at 13:35

How's the Garvan?

The problem is we don't know how many zero counts are observed. We have to estimate this. A classic statistical procedure for situations like this is the Expectation-Maximisation algorithm.

A simple example:

Assume we draw from an unknown population (of 1,000,000) with a poisson constant of 0.2.

counts <- rpois(1000000, 0.2)
table(counts)

     0      1      2      3      4      5
818501 164042  16281   1111     62      3

But we don't observe the zero counts. Instead we observe this:

table <- c("0"=0, table(counts)[2:6])

table

     0      1      2      3      4      5
     0 164042  16281   1111     62      3

Possible frequencies observed

k <- c("0"=0, "1"=1, "2"=2, "3"=3, "4"=4, "5"=5)

Initialise mean of Poisson distribution - just take a guess (we know it's 0.2 here).

lambda <- 1

Expectation - Poisson Distribution

P_k <- lambda^k*exp(-lambda)/factorial(k)
P_k
              0           1           2           3           4           5
0.367879441 0.367879441 0.183939721 0.061313240 0.015328310 0.003065662  
n0 <- sum(table[2:6])/(1 - P_k[1]) - sum(table[2:6])


n0
       0
105628.2     
table[1] <-  105628.2

Maximisation

lambda_MLE <- (1/sum(table))*(sum(table*k))        
lambda_MLE        
[1] 0.697252        
lambda <- lambda_MLE

Second iteration

P_k <- lambda^k*exp(-lambda)/factorial(k)        
n0 <- sum(table[2:6])/(1 - P_k[1]) - sum(table[2:6])       
table[1] <-  n0 
lambda <- (1/sum(table))*(sum(table*k))


 population lambda_MLE


[1,]   361517.1  0.5537774

Now iterate until convergence:

for (i in 1:200) {  
P_k <- lambda^k*exp(-lambda)/factorial(k)  
n0 <- sum(table[2:6])/(1 - P_k[1]) - sum(table[2:6])
table[1] <-  n0
lambda <- (1/sum(table))*(sum(table*k))
}
cbind( population = sum(table), lambda_MLE)
     population lambda_MLE
[1,]    1003774  0.1994473

Our population estimate is 1003774 and our poisson rate is estimated at 0.1994473 - this is the estimated proportion of the population sampled. The main problem you will have in the typical biological problems you are dealing with is assumption that the poisson rate is a constant.

Sorry for the long-winded post - this wiki is not really suitable for R code.

Highlight your code and click on the button that looks like binary numbers... — Shane, Sep 08 '10 at 11:52

onestop · Answer 2 · 2010-09-08T11:29:12.837

This sounds like a form of 'mark and recapture' aka 'capture-recapture', a well-known technique in ecology (and some other fields such as epidemiology). Not my area but the Wikipedia article on mark and recapture looks reasonable, though your situation is not the one to which the Lincoln–Petersen method explained there applies.

I think shabbychef is one the right track for your situation, but using the Poisson distribution to approximate the binomial would probably make things a bit simpler and should be a very good approximation if the population size is very large, as in your examples. I think getting an explicit expression for the maximum likelihood estimate of the population size should then be pretty straightforward (see e.g. Wikipedia again), though i don't have time to work out the details right now.

shabbychef · Answer 3 · 2010-09-08T16:55:59.350

You can estimate via a binomial distribution. If there are $n$ draws, with replacement, from $k$ objects (with $k$ unknown), the probability of an object being drawn once in a single draw is $P = \frac{1}{k}$. Think of this as a coinflip now. The probability of exactly $m$ heads (i.e. $m$ duplicates) from $n$ trials is ${n \choose m} P^m (1-P)^{n-m}$. Multiply this by $n$ to get the expected number of times observed (your plot). For large $n$ it can be a little hairy to back out $k$ from the data, but for small $m$, you can probably do fine assuming the $(1-P)$ term is equal to $1$.

edit: one possible way to fix the numerical problems is to look at the ratios of counts. That is, if $P_m$ is the probability of drawing $m$ heads, then $P_{m} / P_{m+1}$ is equal to $(k-1)\frac{m+1}{n-m}$. Then look at the ratios of counts of duplicates in your data to get multiple estimates of $k$, then take the median or mean.

Estimate the size of a population being sampled by the number of repeat observations

3 Answers3

Linked