2

The probability to have $k$ white balls in a sample of size $n$ taken from an urn of $N$ balls with $K$ of them being white is equal to: $$ P(k|n,N,K) = \frac{{{n}\choose{k}}{{N-n}\choose{K-k}}}{{{N}\choose{n}}} $$ How to infer a probability when $K$ is not determined a priori.

In other words the question would be: What is the probability to have $k$ white balls in a sample of size $n$ taken from an urn of $N$ balls with an unknown amount of them being white (I only know for sure that $k$ are definitely white but it can be more).

Tim
  • 108,699
  • 20
  • 212
  • 390
Kirill
  • 622
  • 1
  • 7
  • 15
  • 1
    Similar binomial question: https://stats.stackexchange.com/questions/123367/estimating-parameters-for-a-binomial/123748#123748 – kjetil b halvorsen Nov 01 '17 at 20:50
  • 1
    See https://stats.stackexchange.com/questions/137331/estimating-size-of-a-set-based-on-two-overlapping-subsets – Glen_b Nov 30 '17 at 06:48

1 Answers1

6

You can estimate $K$ using method of moments estimator

$$ \frac{k}{n} \approx \frac{K}{N} \implies \frac{N}{n} k $$

or maximum likelihood estimator as described by Zhang (2009):

$$ \frac{N-1}{n} k $$

for derivation and further details check the following paper:

Zhang, H. (2009). A note about maximum likelihood estimator in hypergeometric distribution. Comunicaciones en Estadística, 2(2), 169-174.

On another hand, it you want to define a distribution of $k$ white balls, drawn without replacement from the urn containing $N$ balls in total, while treating the total number of white balls $K$ as unknown, i.e. as a random variable, then you can define such problem in terms of Bayesian model, with beta-binomial prior (in fact a conjugate prior) for $K$ (as described by Fink, 1997 and Dyer and Pierce, 1993):

$$ k \sim \mathcal{H}(N,K,n) \\ K \sim \mathcal{BB}(N, \alpha, \beta) $$

what follows to a beta-binomial posterior predictive distribution of $k$ parametrized by $N$, $\alpha' = \alpha + k$ and $\beta' = \beta + N-k$, and the posterior distribution of $K$ is

$$ f(K\mid k,N,\alpha,\beta) = {N-n \choose K-k} \frac{\Gamma(\alpha+K)\,\Gamma(\beta+N-k)\,\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+k)\,\Gamma(\beta+n-k)\,\Gamma(\alpha+\beta+N)} $$

If you want to assume that $K$ can be anything in the $[k, N-n+k]$ range, you can use uniform $\alpha=\beta=1$ prior. For further details check:

Dyer, D. and Pierce, R.L. (1993). On the Choice of the Prior Distribution in Hypergeometric Sampling. Communications in Statistics - Theory and Methods, 22(8), 2125-2146.

You may also be interested in reading about capture-recapture method where you are interested in finding $N$, since it is closely related and follows the same logic.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • How did you compute here the predictive posterior of $k$ (the beta-binomial parametrized by $N$, $\alpha'$, and $\beta'$)? I couldn't find any info about this in the references. – pms Aug 12 '19 at 02:50
  • @pms sorry, forgot to add the link, now updated – Tim Aug 12 '19 at 10:39
  • Actually, I found and checked previously Fink's compendium, but I couldn't find there anything about the predictive posterior. In this case, the predictive posterior is a beta-binomial mixture of hypergeometric distributions, but you wrote here that it's the beta-bionomial with parameters $N$, $\alpha'$, and $\beta'$, so I'm wondering how did you figure this out; is this conclusion straightforward? – pms Aug 13 '19 at 21:44
  • 1
    @pms have you checked the paper by Dyer and Pierce? – Tim Aug 14 '19 at 05:58
  • 1
    Indeed, in Dyer and Pierce there is an expression for marginal distribution of $k$ (their $m(x)$ at the top of page 2131), but it's not explained how it was obtained. I imagine it can be calculated by marginalizing hypergeometric likelihood over the beta-binomial prior, but this calculation looks a bit tedious at first sight. Btw. A small correction -- it should be $\beta'=\beta+n-k$ in your above answer. Everything else looks fine. – pms Aug 16 '19 at 21:50
  • I know this post is getting old, but for the posterior distribution of $K$, shouldn't we be using $\alpha '$ and $\beta '$? If we use $\alpha$ and $\beta$ then the posterior will have a uniform shape. – Travis L Jul 01 '20 at 16:36
  • @silent_spec but this is what the answer says, $\alpha$ and $\beta$ are parameters of the prior. – Tim Jul 01 '20 at 17:27
  • @Tim what I meant is if we leave $\alpha$ and $\beta$ as the parameters of a uniform prior ($\alpha = 1$ and $\beta = 1$) then the posterior will also be uniform between $k$ and $N-n+k$, which seems undesirable as the shape of the posterior is not being determined by the likelihood of the data. – Travis L Jul 13 '20 at 17:57