GLM for count data with all zeroes in one category

Question

I have a GLM where the response variable is count data and the predictive variable is a factor with 4 levels. I used a negative binomial distribution to model the relationship between both variables (there was evidence of overdispersion, so Poisson distribution was not appropriate). However, for one of the levels of the factor, the observed values were all zeroes. I know this has been asked here but the question was never answered.

Here is a reproducible example, where the first level (A) is considered a control group and the last level (D) is the one with only zeroes:

y <- c(5, 12, 25, 6, 10, 8, 4, 3, 0, 0, 1, 0, 0, 0, 0, 0)
fac <- rep(LETTERS[1:4], each = 4)
library(MASS)
model <- glm.nb(y ~ fac)
summary(model)

Here is the summary:

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.4849     0.2749   9.041  < 2e-16 ***
facB          -0.6523     0.4126  -1.581 0.113899    
facC          -3.8712     1.0631  -3.641 0.000271 ***
facD         -21.7875  4713.3084  -0.005 0.996312    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(4.5692) family taken to be 1)

For level D, which was the level with all zeroes, the estimate is relatively lower compared to the control group (intercept). However, with such a large standard error, the p value is inevitably close to 1.

Intuitively, I would have expected that group D would be significantly different from the control group. I suspect that this has to do with the null variance of group D, but I'm not sure how it is related.

I must mention that observing all zeroes in group D was not unexpected (i.e. it is not an anomalous result). A purely descriptive approach would conclude that there is a huge difference between levels A and D, but I would like to be able to provide a formal statistical answer to that problem.

Any help would be greatly appreciated. Thanks!

The fact that you get a P-value close to 1 indicates a numerical failure; this is, in fact, one way of *checking* for complete separation. Of course, the P-value, point estimate, and standard error are all wrong. — guy, Dec 13 '18 at 22:28

Björn · Accepted Answer · 2018-12-14T19:27:19.570

The problem is that there is complete separation in the dataset so that you are essentially comparing $\log 0/t_i$ to $\log y_j/t_j$. I.e. the maximum likelihood estimate lies on the boundary of the parameter space (minus or plus infinity on the log scale).

If you assume that the rate ratio between the groups is not in truth infinite and you can quantify your prior beliefs, then a Bayesian approach may be attractive and is easy to implement (using e.g. the rstanarm R package). Using rstanarm, this is as easy as (just be aware of the inverse use for the dispersion parameter - see the rstanarm documentation):

library(rstanarm)    
stan_glm(data = yourdata, y ~ fac, offset = log_time_offset_if_any,
         family = neg_binomial_2)

Thanks for the info Björn! I'll certainly have a go at a bayesian approach. — Cédric_Frenette, Dec 14 '18 at 18:15

GLM for count data with all zeroes in one category

1 Answers1

Linked