binomial GLM output hugely affected by a factor level with all zeros

Question

I’m trying to estimate the effects of pest control on rat abundance, but I'm finding that the inclusion of a pest control category that contains all zeros is having huge effects on the results. My response variable is "propTun”, which is the proportion of tracking tunnels at a given site that recorded rat footprints (out of ~10 tunnels). My predictor variable is "PatchPC", which is the type of pest control being undertaken at each site, and is a factor with four levels: predator free (i.e. complete eradication, with all zeros for propTun), intensive control, intermediate control, and no control.

I initially modelled these data with a linear model, after logit-transforming propTun, and got the following results:

Call:
lm(formula = logit(propTun) ~ PatchPC, data = pest)

Residuals:
   Min     1Q Median     3Q    Max 
-4.576 -1.608  0.000  1.680  4.944 

Coefficients:
                            Estimate Std. Error t value        Pr(>|t|)    
(Intercept)                  -3.6636     0.5971  -6.135 0.0000000060309 ***
PatchPCIntensive control      2.3828     0.6876   3.465        0.000675 ***
PatchPCIntermediate control   4.5757     0.6564   6.971 0.0000000000708 ***
PatchPCNo control             3.9762     0.7079   5.617 0.0000000803328 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.313 on 166 degrees of freedom
Multiple R-squared:  0.2724,    Adjusted R-squared:  0.2593 
F-statistic: 20.72 on 3 and 166 DF,  p-value: 0.00000000001885

These results were pretty much what I expected – i.e. differences in parameter estimates among the predator control categories, and a reasonable amount of variance explained. However, this model is not valid because there is a large differences in variance between the "predator free" category (which has a propTun value of zero for all sites) and the other categories.

A better approach should be a binomial GLM. However, this gave very different results [FYI “nCounts” is number of tunnels per site]:

Call:
glm(formula = propTun ~ PatchPC, family = "binomial", data = pest, 
    weights = nCounts)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.5735  -2.2339  -0.0005   2.4553   4.8196  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)
(Intercept)                   -18.12     425.76  -0.043    0.966
PatchPCIntensive control       17.33     425.76   0.041    0.968
PatchPCIntermediate control    18.73     425.76   0.044    0.965
PatchPCNo control              18.29     425.76   0.043    0.966

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1420.9  on 169  degrees of freedom
Residual deviance: 1083.1  on 166  degrees of freedom

Note this model has huge standard errors around the parameter estimates! Also, back-transforming the parameter estimates gives values of either ~0 (“ predator free" category AKA Intercept) or ~1, whereas graphs of the data show that average values should be 0 for the "predator free" category then ~0.3, 0.6, and 0.6 for the other three. Which makes me think something has gone wrong...

I wondered if having a category with all zeros was somehow messing up the results, so reran the model after subsetting the data to exclude the "predator free" category. And I got output more like was expecting:

Coefficients:
                            Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                  -0.7859     0.1005  -7.817  0.00000000000000542 ***
PatchPCIntermediate control   1.3989     0.1273  10.990 < 0.0000000000000002 ***
PatchPCNo control             0.9622     0.1455   6.611  0.00000000003816547 ***

I'm confused here as to why the original binomial GLM (i.e. the one including the “predator free" sites) is giving such different results from the other two models. Assuming it's the one that is giving misleading results, why does it have such large standard errors, why is it estimating propTun to be either ~0 or ~1, and why did removing the sites with the zeros fix the problem?

Thanks for your help,

Jay

Probably Hauck-Donner or quasi-separation. See [this question](http://stats.stackexchange.com/q/45803/3601) and the others it links too. — Aaron left Stack Overflow, Dec 03 '13 at 02:38
It appears you have not yet considered the meaning on the probability scale of a logistic regression coefficient of -18 or +17. — DWin, Dec 03 '13 at 02:52
Some things you can do about separation are [here](http://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression/68917). — Scortchi - Reinstate Monica, Dec 03 '13 at 09:30
Thanks for those suggestions. Tried the robust methods from the 'hlr' package, and while back-transformed param estimates were not quite as close to 0 and 1 as the original data they were still very close. Tried the 'bayesglm' function from the 'arm' package and it brought the std errors in line with what I expected, but still gave param estimates very close to 0 or 1. Think the best approach is to stick with the std glm but with the predator free sites removed. I don't actually need statistics to tell me that predator free sites should have zero predators in them anyway! Cheers — jay, Dec 04 '13 at 03:36
So since the problem did appear to be quasi separation, @Aaron if you want to rewrite your comment as an answer I can tick it as answered. — jay, Dec 04 '13 at 03:38
Thanks, Jay. Glad to be of help. I don't think this questions add much beyond those other questions, though, so instead I'll flag it as a duplicate. — Aaron left Stack Overflow, Dec 04 '13 at 13:30

binomial GLM output hugely affected by a factor level with all zeros

0 Answers0