Framing the negative binomial distribution for DNA sequencing

Question

The negative binomial distribution has become a popular model for count data (specifically the expected number of sequencing reads within a given region of the genome from a given experiment) in bioinformatics. Explanations vary:

Some explain it as something that works like the Poisson distribution but has an additional parameter, allowing more freedom to model the true distribution, with a variance not necessarily equal to the mean
Some explain it as a weighted mixture of Poisson distributions (with a gamma mixing distribution on the Poisson parameter)

Is there a way to square these rationales with the traditional definition of a negative binomial distribution as modeling the number of successes of Bernoulli trials before seeing a certain number of failures? Or should I just think of it as a happy coincidence that a weighted mixture of Poisson distributions with a gamma mixing distribution has the same probability mass function as the negative binomial?

It is also a compound Poisson distribution where you sum a Poisson-distributed number of logarithmic random variables. — Douglas Zare, Sep 22 '12 at 00:09

score 10 · Answer 1 · answered Sep 21 '12 at 20:52

IMOH, I really think that the negative binomial distribution is used for convenience.

So in RNA Seq there is a common assumption that if you take an infinite number of measurements of the same gene in an infinite number of replicates then the true distribution would be lognormal. This distribution is then sampled via a Poisson process (with a count) so the true distribution reads per gene across replicates would be a Poisson-Lognormal distribution.

But in packages that we use such as EdgeR and DESeq this distribution modeled as a negative binomial distribution. This is not because the guys that wrote it didn't know about a Poisson Lognormal distribution.

It is because the Poisson Lognormal distribution is a terrible thing to work with because it requires numerical integration to do the fits etc. so when you actually try to use it sometimes the performance is really bad.

A negative binomial distribution has a closed form so it is a lot easier to work with and the gamma distribution (the underlying distribution) looks a lot like a lognormal distribution in that it sometimes looks kind of normal and sometimes has a tail.

But in this example (if you believe the assumption) it can't possibly be theoretically correct because the theoretically correct distribution is the Poisson lognormal and the two distributions are reasonable approximations of one another but are not equivalent.

But I still think the "incorrect" negative binomial distribution is often the better choice because empirically it will give better results because the integration performs slowly and the fits can perform badly, especially with distributions with long tails.

score 7 · Answer 2 · answered Sep 22 '12 at 01:46

I looked through a few web pages and couldn't find an explanation, but I came up with one for integer values of $r$. Suppose we have two radioactive sources independently generating alpha and beta particles at the rates $\alpha$ and $\beta$, respectively.

What is the distribution of the number of alpha particles before the $r$th beta particle?

Consider the alpha particles as successes, and the beta particles as failures. When a particle is detected, the probability that it is an alpha particle is $\frac{\alpha}{\alpha+\beta}$. So, this is the negative binomial distribution $\text{NB}(r,\frac{\alpha}{\alpha+\beta})$.
Consider the time $t_r$ of the $r$th beta particle. This follows a gamma distribution $\Gamma(r,1/\beta).$ If you condition on $t_r = \lambda/\alpha$, then the number of alpha particles before time $t_r$ follows a Poisson distribution $\text{Pois}(\lambda).$ So, the distribution of the number of alpha particles before the $r$th beta particle is a Gamma-mixed Poisson distribution.

That explains why these distributions are equal.

Felix Schlesinger · Answer 3 · 2012-09-21T20:33:11.330

2

I can only offer intuition, but the gamma distribution itself describes (continuous) waiting times (how long does it take for a rare event to occur). So the fact that a gamma-distributed mixture of discrete poisson distributions would result in a discrete waiting time (trials until N failures) does not seem too surprising. I hope someone has a more formal answer.

Edit: I always justified the negative binomial dist. for sequencing as follows: The actual sequencing step is simply sampling reads from a large library of molecules (poisson). However that library is made from the original sample by PCR. That means that the original molecules are amplified exponentially. And the gamma distribution describes the sum of k independent exponentially distributed random variables, i.e. how many molecules in the library after amplifying k sample molecules for the same number of PCR cycles.

Hence the negative binomial models PCR followed by sequencing.

edited Sep 21 '12 at 20:33

answered Sep 21 '12 at 20:05

Felix Schlesinger

21
3

That makes sense, but in the context of measuring the number of sequencing reads in a genome is there an intuitive explanation for what the waiting period in the negative binomial distribution represents? In this case there is no waiting period - he's just measuring counts of sequencing reads. – RobertF Sep 21 '12 at 20:31
See my edit. I don't see how thinking of it in terms of waiting times fits the sequencing setting. The gamma poisson mixture is easier to interpret. But in the end they are the same thing. – Felix Schlesinger Sep 21 '12 at 20:34
2

Ok - then perhaps the real question is by what coincidence does modeling k successes + r failures in Bernoulli trials follow a gamma Poisson mixture? Maybe a negative binomial modeling k successes + r failures can be thought of as an overdispersed Poisson dbn due to the many possible permutations of success and failure trials resulting in the exactly k observed successes and r observed failures, which can be described as a collection of separate dbns? – RobertF Sep 21 '12 at 21:02

score 2 · Answer 4 · answered Sep 22 '12 at 16:43

I'll try to give a simplistic mechanistic interpretation that I found useful when thinking about this.

Assume we have a perfect uniform coverage of the genome before library prep, and we observe $\mu$ reads covering a site on average. Say that sequencing is a process that picks an original DNA fragment, puts it through a stochastic process that goes through PCR, subsampling, etc, and comes up with a base from the fragment at frequency $p$, and a failure otherwise. If sequencing proceeds until $\mu\frac{1-p}{p}$ failures, it can be modeled with a negative binomial distribution, $NB(\mu\frac{1-p}{p}, p)$.

Calculating the moments of this distribution, we get expected number of successes $\mu\frac{1-p}{p}\frac{p}{1-p} = \mu$ as required. For variance of the number of successes, we get $\sigma^2 = \mu(1-p)^{-1}$ - the rate at which the library prep fails for a fragment increases the variance in the observed coverage.

While the above is a slightly artificial description of the sequencing process, and one could make a proper generative model of the PCR steps etc, I think it gives some insight into the origin of the overdispersion parameter $(1-p)^{-1}$ directly from the negative binomial distribution. I do prefer the Poisson model with rate integrated out as an explanation in general.

Framing the negative binomial distribution for DNA sequencing

4 Answers4

Linked