10

Can you estimate $N$ with MLE or method of moment or whatever strategy?

  1. $N$ numbered balls are in a bag. $N$ is unknown.
  2. Pick a ball uniformly at random, record its number, replace it, shuffle.
  3. After $M$ samples, of which we noticed $R$ repeated numbers, how can we estimate the value of $N$?

Attempt:

If there are $n$ elements of the set then the probability that $x$ have been selected after a sample of $m$ (with replacement) is

$$\frac{S_2(x,m) \; n!}{n^m \; (n-x)!} $$

And I got stuck. Any idea?

YoYO Man
  • 201
  • 1
  • 2
  • Is $R$ the number of distinct balls you have seen at least once (as suggested in your attempt) or the number you have seen more than once (as in *repeated*)? – Henry Nov 17 '20 at 00:48
  • Which one is easier to come up with solution? @Henry, I think $R$ is the number of disctintct balls. – YoYO Man Nov 17 '20 at 00:49
  • Seems similar to the math used to estimate the total number of German tanks in WW2 based on serial numbers found on captured/destroyed units. That was "sampling without replacement" – Carl Witthoft Nov 17 '20 at 15:43

3 Answers3

5

This is a standard statistical inference problem involving the classical occupancy distribution (see e.g., O'Neill 2019). Since $R$ is the number of repeated balls, the number of distinct balls selected in the sample is given by:

$$K = N-R \ \sim \ \text{Occ}(N, M).$$

The probability mass function for this random variable is:

$$p(K=k|N,M) = \frac{(N)_k \cdot S(M,k)}{N^M} \cdot \mathbb{I}(1 \leqslant k \leqslant \min(M,N)),$$

where the values $S(M,k)$ are the Stirling numbers of the second kind and $(N)_k$ are the falling factorials. The classical occupancy distribution has been subject to a great deal of analysis in the statistical literature, including analysis of statistical inference for the size parameter $N$ (see e.g., Harris 1968). The form of this distribution and its moments is known, so deriving the MLE or MOM estimators is a relatively simple task.


Maximum-likelihood estimator (MLE): Since the size parameter is an integer, we can find the MLE using discrete calculus. For any value $1 \leqslant k \leqslant \min(M,N)$ the forward difference of the probability mass function with respect to $N$ can be written as:

$$\begin{align} \Delta_N p(k) &\equiv p(K=k|N+1,M) - p(K=k|N,M) \\[10pt] &= \frac{(N+1)_k \cdot S(M,k)}{(N+1)^M} - \frac{(N)_k \cdot S(M,k)}{N^M} \\[6pt] &= S(M,k) \bigg[ \frac{(N+1)_k}{(N+1)^M} - \frac{(N)_k}{N^M} \bigg] \\[6pt] &= S(M,k) \cdot \frac{(N)_{k}}{(N+1)^M} \bigg[ \frac{N+1}{N-k+1} - \Big( \frac{N+1}{N} \Big)^M \ \bigg] \\[6pt] \end{align}$$

Thus, if we observe $K=k$ then the maximum-likelihood-estimator (MLE) is given by:

$$\hat{N}_\text{MLE} = \max \bigg \{ N \in \mathbb{N} \ \Bigg| \ \frac{N+1}{N-k+1} < \Big( \frac{N+1}{N} \Big)^M \bigg \}.$$

(There may be cases where the MLE is not unique, since we can also use the $\leqslant$ instead of $<$ in the inequality in this equation.) Here is a simple function in R to compute the MLE and an example when the input values are fairly large.

MLE.Occ.n <- function(m, k) {
  n <- k
  while ((n+1)/(n-k+1) >= (1+1/n)^m) { n <- n+1 }
  n }

MLE.Occ.n(m = 1000, k = 649)
[1] 1066

Estimation using method-of-moments: The first four moments of the classical occupancy distribution are given in O'Neill (2019) (Section 2). The expected number of different balls is:

$$\mathbb{E}(K) = N \Bigg[ 1 - \Big( 1-\frac{1}{N} \Big)^M \Bigg].$$

Thus, if we observe $K=k$ then the method-of-moments estimator will approximately solve the implicit equation:

$$\log \hat{N}_\text{MOM}^* - \log k + \text{log1mexp} \Bigg[ - M \log \Big( 1-\frac{1}{\hat{N}_\text{MOM}^*} \Big) \Bigg] = 0.$$

You can solve this equation numerically to obtain a real value $\hat{N}_\text{MOM}^*$ and then use one of the two surrounding integers as $\hat{N}_\text{MOM}$ (these each give slight over- and under-estimates for the true expected value and you can then pick between these using some appropriate method --- e.g., rounding to the nearest integer). Here is a function in R to compute the method-of-moment estimator. As can be seen, it gives the same result as the MLE in the present example.

MOM.Occ.n <- function(m, k) {
  FF     <- function(n) { log(n) - log(k) + VGAM::log1mexp(-m*log(1-1/n)) }
  UPPER  <- m*k/(m-k)
  n.real <- uniroot(f = FF, lower = k, upper = UPPER)$root
  round(n.real, 0) }

MOM.Occ.n(m = 1000, k = 649)
[1] 1066
Ben
  • 91,027
  • 3
  • 150
  • 376
  • I am not quite sure what your $k$ is supposed to represent. Is it the number of distinct balls drawn, or the number of balls not drawn? In your example with $M=1000$ and $k=649$ your suggestion of $\hat N = 1066$ looks like a sensible value for the total number of balls if $k$ is the number of distinct balls drawn (my approximation using the Lambert W function suggests about $1066.86$). But your method of moments approach suggests $\hat N\approx 1356.68$ which only makes sense if $k$ is the number not drawn; but that is unobservable – Henry Nov 17 '20 at 09:08
  • If you had written $\log ( \hat{N}_\text{MOM}-R) \approx M \log (\hat{N}_\text{MOM} - 1) - (M-1) \log \hat{N}_\text{MOM}$ and used $M=1000$ and $R=649$ then the solution would be $\hat{N}_\text{MOM} \approx 1066.1$, close to the MLE estimate and my approximation – Henry Nov 17 '20 at 09:09
  • Yes, the random variable $K=N-R$ is the number of distinct balls drawn (called the "occupancy" in occupancy problems). – Ben Nov 17 '20 at 21:35
  • If $K$ is the number of distinct balls drawn then $\mathbb{E}(K) = N-N \Big( 1-\frac{1}{N} \Big)^M$ which is not what you have written – Henry Nov 17 '20 at 21:38
  • @Henry: I have corrected the MOM section to use the correct formula and give a function to compute. – Ben Nov 17 '20 at 22:23
  • OK - now you have reached my answer – Henry Nov 18 '20 at 01:05
2

I think your likelihood expression has reversed $x=R$ and $m=M$ in $S_2(x,m)$ but no matter - this is a constant with respect to $N$ and so can be ignored. What you want is the integer $N$ which maximises $\frac{N!}{N^M \; (N-R)!}$. So you want the largest $N$ where $\frac{N!}{N^M \; (N-R)!} \ge \frac{(N-1)!}{(N-1)^M \; (N-1-R)!} $, i.e. where $N\left(\frac{N-1}{N}\right)^M\ge N-R$, though I doubt this has a simple closed form for $N$.

Another possible approach using a method of moments might be to consider a particular ball so the probability it is never selected is $\left(\frac{N-1}{N}\right)^M$, and the expected number of balls never selected is $N\left(\frac{N-1}{N}\right)^M$ and the expected number selected at least once is $N - N\left(\frac{N-1}{N}\right)^M$, If you see $R$ distinct balls from $M$ attempts then you could try to solve $R= N - N\left(\frac{N-1}{N}\right)^M$ for $N$. This is essentially the same equation as the likelihood approach, though without the rounding down.

Solving this would not be easy, but in some cases you could use the approximation $\left(\frac{N-1}{N}\right)^M \approx e^{-M/N}$ in which case you might consider $$\hat N\approx \dfrac{M}{\frac{M}{R}+ W\left(-\frac MRe^{-M/R}\right)}$$ where $W$ is the Lambert W function. (When $M \gg R$ the denominator is almost $\frac MR$ so $\hat N$ is very slightly more than $R$, as one might expect.)

As an illustration, if $M=100$ and $R=50$ then direct calculation would eventually give you $\hat N \approx 62.41$ while the suggested approximation could give you $\hat N\approx 62.75$. The likelihood approach would say $\hat N \le 62.41$ so round this down to $\hat N =62$.

Henry
  • 30,848
  • 1
  • 63
  • 107
  • And now I have found a related question nine years ago, where I gave two answers: https://stats.stackexchange.com/questions/9240/estimating-population-size-from-the-frequency-of-sampled-duplicates-and-uniques – Henry Nov 17 '20 at 02:09
0

I think you would need another constraint. As described, it would only be possible to estimate a lower bound on the number. There could be any number of balls.

I think you needed to specify that each ball in the bag has a unique number.