28

Negative binomial (NB) distribution is defined on non-negative integers and has probability mass function$$f(k;r,p)={\binom {k+r-1}{k}}p^{k}(1-p)^{r}.$$ Does it make sense to consider a continuous distribution on non-negative reals defined by the same formula (replacing $k\in \mathbb N_0$ by $x\in\mathbb R_{\ge 0}$)? The binomial coefficient can be rewritten as a product of $(k+1)\cdot\ldots\cdot(k+r-1)$, which is well-defined for any real $k$. So we would have a PDF $$f(x;r,p)\propto\prod_{i=1}^{r-1}(x+i)\cdot p^{x}(1-p)^{r}.$$ More generally, we can replace the binomial coefficient with Gamma functions, allowing for non-integer values of $r$: $$f(x;r,p)\propto\frac{\Gamma(x+r)}{\Gamma(x+1)\Gamma(r)}\cdot p^{x}(1-p)^{r}.$$

Is it a valid distribution? Does it have a name? Does it have any uses? Is it maybe some compound or a mixture? Are there closed formulas for the mean and the variance (and the proportionality constant in the PDF)?

(I am currently studying a paper that uses NB mixture model (with fixed $r=2$) and fits it via EM. However, the data are integers after some normalization, i.e. not integers. Nevertheless, the authors apply the standard NB formula to compute the likelihood and get very reasonable results, so everything seems to work out just fine. I found it very puzzling. Note that this question is not about NB GLM.)

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    Wouldn't that be a mixture of Gammas with scale parameter $-\log p$? If you expand the polynomial $\Pi_{i=1}^{r-1}(x+i)$ you'll just get $\sum_{i=2}^ra_ix^{i-1}$, then multiplying by $p^x$ is the same as by $\exp\{x\log p\}$, where $a_i$ is the coefficient of $x^{i-1}$ in the polynomial and $\log p < 0$ of course, so it looks like it would convert to a weighted average of Gamma distributions, i.e., a mixture. – jbowman Oct 30 '17 at 03:16
  • ... should be $i=1$ in the sum above, actually. – jbowman Oct 30 '17 at 03:25
  • Can very well be @jbowman. Would be great if you could write it up as an answer in a bit more detail. – amoeba Oct 30 '17 at 08:45
  • 2
    Since $(1-p)^r$ depends only on the parameters, it is a constant that can be absorbed in the proportionality. Moreover, $\binom{x+r-1}{x}=\Gamma(x+r)/(\Gamma(r)\Gamma(x+1))$ also has a constant $1/\Gamma(r)$ that can be ignored. Writing $p^k=e^{-k\rho}$ for $\rho=-\log(p) \ge 0$, you are asking about a density proportional to $$f(x;r,\rho)=\frac{\Gamma(x+r)}{\Gamma(x+1)}\,e^{-\rho x}.$$ That identifies $\rho$ as a scale factor and $r$ as a shape parameter. For *integral* $r$ it's clearly a mixture of Gamma distributions. It makes no sense to restrict $r$ to integers, though. – whuber Oct 30 '17 at 14:38
  • @whuber Thanks. It would be great if you found time to post this as an answer. I realize that this question is rather straightforward; two reasons that I nevertheless decided to post it here (instead of working out the math myself) is that (1) I googled but could not find anywhere a discussion of this topic ("Continuous generalization of NB"), despite it being seemingly natural; and (2) I am curious whether this mixture of Gammas has some known usage or some nice properties, given that NB is a pretty standard distribution. – amoeba Oct 30 '17 at 20:34
  • @whuber Two properties that are relevant for the application I am reading/thinking about are (i) [approximately] quadratic mean-variance relationship, and (ii) finite non-zero likelihood at $x=0$. Negative binomial has these two properties. – amoeba Oct 30 '17 at 20:38
  • Gamma distributions have those properties, too. In order to compare a distribution on the integers to a continuous distribution you should think in terms of finite differences of the distribution function: that is, rather than $f(x;\theta)$ you should be considering $\int_{x}^{x+1} f(t;\theta)dt$. In this sense all Gamma distributions have nonzero likelihood at $x=0$. – whuber Oct 30 '17 at 20:43
  • @whuber Well, if I have to do maximum likelihood estimation and my sample of non-negative numbers includes some values of $x=0$, then I can't possibly fit any Gamma apart from the ones with shape parameter $k=1$. I have to care about the *exact* likelihood value at $x=0$... – amoeba Oct 30 '17 at 20:50
  • Understood. I'm just saying that this just doesn't look like a suitable criterion for comparing NB and Gamma distributions. If you're getting *any* exact zeros in your data (well, more than once in your lifetime) then *no* continuous distribution is an appropriate model, anyway! – whuber Oct 30 '17 at 20:55
  • 1
    @whuber Right. I am actually using a distribution that is continuous on positive values and has a point mass at zero. I believe this is the correct approach. But I've been suggested to use a continuous generalization of NB that would have non-zero likelihood at zero and hence seemingly allow to deal with exact zeros. Hence my question. – amoeba Oct 30 '17 at 21:05
  • 2
    I think there may be some confusion in that suggestion: it appears to conflate a *probability* (which is what a point mass has or an NB distribution has at zero) with a probability *density* (which is what the value of $f(0,\theta)$ would be). A nonzero density doesn't allow you to deal with exact zeros, because it still predicts zero chance that any value of $0$ will arise! – whuber Oct 30 '17 at 22:51

2 Answers2

29

That's an interesting question. My research group has been using the distribution you refer to for some years in our publicly available bioinformatics software. As far as I know, the distribution does not have a name and there is no literature on it. While the paper by Chandra et al (2012) cited by Aksakal is closely related, the distribution they consider seems to be restricted to integer values for $r$ and they don't seem to give an explicit expression for the pdf.

To give you some background, the NB distribution is very heavily used in genomic research to model gene expression data arising from RNA-seq and related technologies. The count data arises as the number of DNA or RNA sequence reads extracted from a biological sample that can be mapped to each gene. Typically there are tens of millions of reads from each biological sample that are mapped to about 25,000 genes. Alternatively one might have DNA samples from which reads are mapped to genomic windows. We and others have popularized an approach whereby NB glms are fitted to the sequence reads for each gene, and empirical Bayes methods are used to moderate the genewise dispersion estimators (dispersion $\phi=1/r$). This approach has been cited in tens of thousands of journal articles in the genomic literature, so you can get an idea of how much it gets used.

My group maintains the edgeR R sofware package. Some years ago we revised the whole package so that it works with fractional counts, using a continuous version of the NB pmf. We simply converted all the binomial coefficients in the NB pmf to ratios of gamma functions and used it as a (mixed) continuous pdf. The motivation for this was that sequence read counts can sometimes be fractional because of (1) ambiguous mapping of reads to the transcriptome or genome and/or (2) normalization of counts to correct for technical effects. So the counts are sometimes expected counts or estimated counts rather than observed counts. And of course the read counts can be exactly zero with positive probability. Our approach ensures that the inference results from our software are continuous in the counts, matching exactly with discrete NB results when the estimated counts happen to be integers.

As far as I know, there is no closed form for the normalizing constant in the pdf, nor are there closed forms for the mean or variance. When one considers that there is no closed form for the integral $$\int_0^\infty \frac{1}{\Gamma(x)}dz$$ (the Fransen-Robinson constant) it is clear that there cannot be for the integral of the continuous NB pdf either. However it seems to me that traditional mean and variance formulas for the NB should continue to be good approximations for the continuous NB. Moreover the normalizing constant should vary slowly with the parameters and therefore can be ignored as having negligible influence in the maximum likelihood calculations.

One can confirm these hypotheses by numerical integration. The NB distribution arises in bioinformatics as a gamma mixture of Poisson distributions (see the Wikipedia negative binomial article or McCarthy et al below). The continuous NB distribution arises simply by replacing the Poisson distribution with its continuous analog with pdf $$f(x;\lambda)=a(\lambda)\frac{e^{-\lambda}\lambda^x}{\Gamma(x+1)}$$ for $x\ge 0$ where $a(\lambda)$ is a normalizing constant to ensure the density integrates to 1. Suppose for example that $\lambda=10$. The Poisson distribution has pmf equal the above pdf on the non-negative integers and, with $\lambda=10$, the Poisson mean and variance are equal to 10. Numerical integration shows that $a(10)=1/0.999875$ and the mean and variance of the continuous distribution are equal to 10 to about 4 significant figures. So the normalizing constant is virtually 1 and the mean and variance are almost exactly the same as for the discrete Poisson distribution. The approximation is improved even more if we add a continuity correction, integrating from $-1/2$ to $\infty$ instead of from 0. With the continuity correction, everything is correct (normalizing constant is 1 and moments agree with discrete Poisson) to about 6 figures.

In our edgeR package, we do not need to make any adjustment for the fact that there is mass at zero, because we always work with conditional log-likelihoods or with log-likelihood differences and any delta functions cancel out of the calculations. This is typical BTW for glms with mixed probability distributions. Alternatively, we could consider the distribution to have no mass at zero but to have support starting at -1/2 instead of at zero. Either theoretical perspective leads to the same calculations in practice.

Although we make active use of the continuous NB distribution, we haven't published anything on it explicitly. The articles cited below explain the NB approach to genomic data but don't discuss the continuous NB distribution explicitly.

In summary, I am not surprised that the article you are studying obtained reasonable results from a continualized version of the NB pdf, because that is our experience also. The key requirement is that we should be modelling the means and variances correctly and that will be fine provided the data, whether integer or not, exhibits the same form of quadratic mean-variance relationship that the NB distribution does.

References

Robinson, M., and Smyth, G. K. (2008). Small sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321-332.

Robinson, MD, and Smyth, GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887.

McCarthy, DJ, Chen, Y, Smyth, GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297.

Chen, Y, Lun, ATL, and Smyth, GK (2014). Differential expression analysis of complex RNA-seq experiments using edgeR. In: Statistical Analysis of Next Generation Sequence Data, Somnath Datta and Daniel S Nettleton (eds), Springer, New York, pages 51--74. Preprint

Lun, ATL, Chen, Y, and Smyth, GK (2016). It's DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods in Molecular Biology 1418, 391-416. Preprint

Chen Y, Lun ATL, and Smyth, GK (2016). From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438.

Gordon Smyth
  • 8,964
  • 1
  • 25
  • 43
  • This is extremely helpful, @Gordon; thanks a lot for taking the time to write it up. I am working with RNA-seq data as well, so an answer from this perspective is especially valuable (I have now added [bioinformatics] tag to the question). Your work is about differential expression, whereas my current work is about clustering (the paper I was reading is Harris et al. on CA1 interneurons; [biorxiv](https://www.biorxiv.org/content/early/2017/06/28/143354)). Anyway, let me ask you a couple of small questions/clarifications. [cont.] – amoeba Nov 05 '17 at 09:22
  • (1) You said that continuous NB is a gamma mixture of continuous Poissons. Could you expand it on it a little bit, perhaps show it a bit more explicitly? I think this will be useful for general audience. Related to that, in the comments under my question two people wrote that continuous NB should be a mixture of Gammas with scale parameter $-\log(p)$, but only for integer $r$. Are both views true? (2) You said that delta function on zero does not matter for GLMs. At the same time, there is large literature on GLMs with zero-inflated distributions. How does that fit together? – amoeba Nov 05 '17 at 09:28
  • (3) In your practical work, do you use ML to estimate all the parameters, including $r$, or do you fix $r$ to some specific value in advance (perhaps the same value shared for all genes?) and then hold it constant? I would guess that this should be much easier. (E.g. NB itself is exponential dispersion family but only with fixed $r$.) – amoeba Nov 05 '17 at 09:31
  • 1
    @amoeba Thanks for the biorxiv ref. (1) The derivation of NB as a mixture of Poissons is quite well known, and is in our papers eg McCarthy et al. The derivation of the continuous NB follows just by substituting continuous Poisson for Poisson. Should I add this to my answer? Would make it long. I don't see how the continuous NB could be usefully represented as a mixture of gammas. (2) No, zero-inflation is a different additional complication. We avoid that complication in our work. – Gordon Smyth Nov 05 '17 at 10:00
  • 1
    @amoeba (3) We estimate all the parameters. It is crucial to estimate the genewise dispersions to achieve error rate control, and this must be done with special care because the sample sizes are often tiny and the dimension of the data is huge. We use a complex procedure that involves adjusted profile likelihood (think REML) within each gene linked with a weighted-likelihood empirical Bayes procedure between genes. The genewise NB glms are then fitted by ML with the dispersions fixed. Finally, coefficients are tested using quasi-likelihood F-tests. – Gordon Smyth Nov 05 '17 at 10:09
  • @amoeba I find it strange that the biorxiv paper makes no mention of the fact that different cells have different sequence depths. Read counts obviously depend on sequence depth as much as on gene expression level, but the paper treats read counts as equivalent to expression. The development in the paper doesn't make sense to me as it is currently stated. – Gordon Smyth Nov 05 '17 at 22:59
  • I think they normalize each cell's counts by the sum of counts, see first page of Methods. That's what turns integer counts into non-integers, otherwise they would be integer because they are using UMIs. – amoeba Nov 05 '17 at 23:08
  • @amoeba OK I see. They just say that they normalize, they don't actually say how. Just scaling the counts up or down in a linear fashion would play havoc with the mean-variance relationship. (Feel free to move these comments to chat if you think appropriate.) – Gordon Smyth Nov 05 '17 at 23:55
  • I assume that just scale linearly. I would be very interested in discussing this further, but this is rather off-topic here. Let me read some of your papers that you linked to and then I might write you an email if I may. Meanwhile I awarded my bounty to this answer and started an additional one because I think this answer deserves more recognition. – amoeba Nov 06 '17 at 21:00
  • @amoeba Thanks, that's generous. Feel free to email me. – Gordon Smyth Nov 06 '17 at 22:27
19

Look at this paper: Chandra, Nimai Kumar, and Dilip Roy. A continuous version of the negative binomial distribution. Statistica 72, no. 1 (2012): 81.

It's defined in the paper as the survival function, which is a natural approach since neg binomial was introduced in reliability analysis:

$$S_r(x)=\begin{cases}q^x & \text{for}\ r=1 \\ \sum_{k=0}^{r-1}\binom {x+k-1}{k}p^kq^x & \text{for}\ r=2,3,\dots \end{cases}$$ where $q=e^{-\lambda},\lambda\ge 0,p+q=1$ and $r\in\mathbb N,r>0$.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • Thanks! I will take a look at this paper. (It wasn't me who downvoted.) – amoeba Nov 01 '17 at 00:57
  • @amoeba, I don't worry about downvoting, it's internet :) – Aksakal Nov 01 '17 at 00:59
  • 3
    (It's bizarre that this reply was downvoted...) +1 – whuber Nov 01 '17 at 13:39
  • It's good to have this reference, but ideally I would like to see a more detailed discussion here. Is this survival function defining the same distribution as the PDF in my question? (By the way, I find it a bit weird that the authors use binomial coefficients for non-integer values of $x$.) Several comments above point out that this is a mixture of gamma distributions (I don't see any discussion of this in the paper); what are the parameters of these gammas, what are the mixture weights? Do the NB formulas for the mean and the variance hold for the continuous version? – amoeba Nov 01 '17 at 17:14
  • @amoeba, the paper has moments, they're not the same as in NB, unfortunately – Aksakal Nov 01 '17 at 18:38