5

Our code goes through multiple stages of review. I wish to use the number of defects at an earlier stage of review as a "defect density" estimate for later stages.

It sometimes happens that code has zero defects in the early stage of review. This is causing me trouble since if $\lambda = 0$ then $P(k)=\frac{e^{-\lambda t}(\lambda t)^k}{k!}=0$ for all $k$.

R does indeed just throw an error in this case:

foo = 0:10
bar = 2 * foo
glm(bar ~ log(foo), family = poisson)
# fails because log(0) = -Inf

I could get around this in several ways:

  • Ignore places with zeros (this would drop 1,486 of my 4,476 data points)
  • Replace all the zeros with some number
  • Add one to everything

What's the best way to handle this?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Xodarap
  • 2,448
  • 2
  • 17
  • 24
  • 2
    Producing inference for $\lambda$ when the count is zero is an [interesting problem](http://en.wikipedia.org/wiki/Maximum_likelihood#Asymptotic_normality) (see the part "Estimate on boundary"). I'd be inclined to take a Bayesian approach for this kind of application. – Glen_b Feb 15 '13 at 15:22
  • 5
    This question is based on an incorrect assertion. When $\lambda=0$, $P(0) = \exp(0)(0)^0/0! = 1$, not $0$. The problem is the use of a *log link,* because $\log(0)$ is undefined. – whuber Feb 15 '13 at 16:28
  • @whuber: good point! Is there a way around this in R? – Xodarap Feb 15 '13 at 17:58
  • 2
    `R` (via `glm`) allows various link functions; you can even define your own. – whuber Feb 15 '13 at 18:18
  • 1
    Yes, intensity zero should not crash Poisson functions, but it doesn't solve your problem, because it still produces zero probability of defects, which is not a reasonable prediction by any measure – Aksakal May 22 '19 at 21:05

2 Answers2

2

Either (as mentioned in a comment by @Glen_b) use Bayesian methods, or some kind of borrowing strength, that is, analyzing multiple data sets with a common model, with some common parameters (that can be seen as a way of empirical bayes.) You say

It sometimes happens that code has zero defects in the early stage of review

The estimated (mle) parameter of zero leads to predictions of zero future errors, though one could make prediction intervals (of the form $[0, .)$) for number of future errors, in some model. But borrowing strength by modeling multiple datasets together seems to me better. This is now a deveoped field, see https://en.wikipedia.org/wiki/List_of_software_reliability_models and this google scholar search

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
1

Bayesian is a sensible approach. However, if you're frequentists the intensity is still not zero even if you didn't observe any events.

Suppose, your'e asked what is the probability of Sun not rising tomorrow? You've been on Earth for 40 years, and it has always risen. So, from frequentist point of view the intensity is 1 in 40 years, as Laplace would suggest. Next year it will be 1 in 41 years and so on. Hence, you should set the intensity to 1 in whatever the scale is. For instance, if the scale is the lines of code, then $\lambda =\frac 1 {\text{lines of code}}$

Heteroskedastic Jim
  • 4,567
  • 1
  • 10
  • 32
Aksakal
  • 55,939
  • 5
  • 90
  • 176