Hypothesis testing for large N small k

Question

I've got a set of differentially expressed biomarkers that I want to check for the significance of this observation.

For a similar problem, I've seen the hypergeometric test being used, where

$k$ = number of detected differentially expressed biomarkers
$K$ = total number of known differentially expressed biomarkers
$n$ = size of sample
$N$ = total population

to compute the p-value of seeing $\geq k$ biomarkers.

The tricky thing here is:

the event is very rare. i.e., $N$ >> $K$ (i.e. $\frac{K}{N} < 10^{-6}$)
the true value of $K$ is unknown; I've got an approximate number but the actual value of $K$ is likely to be larger. I've seen this post but not sure it's applicable to my dataset given the rarity of seeing a "Type I" object
[EDIT] the typical size of $n$, my sample, is around $\sim 10^6$, and it's sampling without replacement. Side note: the true value of $N$ is not known either but typically approximated as $N \geq 10^9$

To compute the p-value of seeing $\geq k$ biomarkers for my dataset, does it still make sense to use a hypergeometric test?

I was wondering if a Poisson exact test makes more sense where the null hypothesis assumes that the rate is equal to $K/N$ against the alternative of $k/n$ in my sample?

Ben · Accepted Answer · 2020-09-11T10:44:21.067

As $N \rightarrow \infty$ the hypergeometric distribution converges to a binomial distribution (with size parameter $n$ and probability $K/N$), so that distribution would be a natural approximation in the case where $N$ is large. Since $K$ is unknown, one reasonable approach would be to give the probability parameter a prior distribution and proceed from there. The conjugate Bayesian approach would be to give the probability parameter a beta prior, leading to a beta-binomial distribution for the obervable value $k$. If you were to use this approach then your distributional approximation would be:

$$p(k|n) = \text{BetaBin}(k|n,\alpha,\beta) = {n \choose k} \frac{\text{B}(k+\alpha,n-k+\beta)}{\text{B}(\alpha,\beta)},$$

where $\alpha>0$ and $\beta>0$ are hyperparameters. (One simple case is to use a uniform prior with $\alpha=\beta=1$.) Based on your updated information, which specifies that $n$ is also large, you could take the Poisson approximation to the binomial if you wish, and this would lead to a different approximating distribution (e.g., Poisson-gamma). In any case, you can compute probabilities from the beta-binomial distribution in R using the pbetabinom function in the rmutil package.

Thanks @Ben, I'll update the post. Unfortunately, $n$ is also quite a large number! — Anonymous Scientist, Sep 11 '20 at 10:24
I suppose extending this idea, we could assume that the data comes from a Poisson since $n$ is also large, then we can use a Gamma prior? — Anonymous Scientist, Sep 11 '20 at 10:37
Yes, that is correct. So you could use the Poisson approximation if you prefer. If you use a gamma prior then you would get a Poisson-gamma predictive distribution. I have updated the answer to reflect the new information you have added. — Ben, Sep 11 '20 at 10:45

Hypothesis testing for large N small k

1 Answers1