5

I know that I, and others, sometimes get confused by the hypergeometric distribution (HD) as it pertains to overlapping lists. This is because the HD is usually described with the "balls in an urn" metaphor and not using "overlapping lists."

What is the proper way to calculate the p-value, according to the hypergeometric distribution, for the overlap of $B$ and $C$ in the lists below, ideally using the phyper function in R, where

  • $A$ contains all of the genes in the genome
  • $B$ is one subset of genes in the genome
  • $C$ is another subset of genes in the genome?
Ron Gejman
  • 175
  • 2
  • 7

1 Answers1

9

Trying to translate this into a statistical question, it seems you have a population with $a$ members and you take two random samples without replacement sized $b$ and $c$, and you want the distribution of $X$, the number appearing in both samples.

As an illustration, suppose $a=5$, $b=2$ and $c=3$. There are 100 ways of taking the samples, of which 10 have none in common, 60 have one in common and 30 have two in common. It the language of black and white balls in an urn, the urn has $b=2$ white balls and $a-b=3$ black balls, and we take $c=3$ balls out to inspect how many white balls come out. In R we can effectively get these values with

> totalpop <- 5 
> sample1  <- 2
> sample2  <- 3 
> dhyper(0:2, sample1, totalpop-sample1, sample2) 
[1] 0.1 0.6 0.3
> phyper(-1:2, sample1, totalpop-sample1, sample2) 
[1] 0.0 0.1 0.7 1.0

which confirms the earlier calculations.

If you want to test a number overlap, then the probability of getting that number or smaller from this model is

phyper(overlap, sampleb, totala - sampleb, samplec) 

and of getting that number or larger is

1 - phyper(overlap - 1, sampleb, totala - sampleb, samplec)
Henry
  • 30,848
  • 1
  • 63
  • 107