I have a GLM where the response variable is count data and the predictive variable is a factor with 4 levels. I used a negative binomial distribution to model the relationship between both variables (there was evidence of overdispersion, so Poisson distribution was not appropriate). However, for one of the levels of the factor, the observed values were all zeroes. I know this has been asked here but the question was never answered.
Here is a reproducible example, where the first level (A) is considered a control group and the last level (D) is the one with only zeroes:
y <- c(5, 12, 25, 6, 10, 8, 4, 3, 0, 0, 1, 0, 0, 0, 0, 0)
fac <- rep(LETTERS[1:4], each = 4)
library(MASS)
model <- glm.nb(y ~ fac)
summary(model)
Here is the summary:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4849 0.2749 9.041 < 2e-16 ***
facB -0.6523 0.4126 -1.581 0.113899
facC -3.8712 1.0631 -3.641 0.000271 ***
facD -21.7875 4713.3084 -0.005 0.996312
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(4.5692) family taken to be 1)
For level D, which was the level with all zeroes, the estimate is relatively lower compared to the control group (intercept). However, with such a large standard error, the p value is inevitably close to 1.
Intuitively, I would have expected that group D would be significantly different from the control group. I suspect that this has to do with the null variance of group D, but I'm not sure how it is related.
I must mention that observing all zeroes in group D was not unexpected (i.e. it is not an anomalous result). A purely descriptive approach would conclude that there is a huge difference between levels A and D, but I would like to be able to provide a formal statistical answer to that problem.
Any help would be greatly appreciated. Thanks!