I’m trying to estimate the effects of pest control on rat abundance, but I'm finding that the inclusion of a pest control category that contains all zeros is having huge effects on the results. My response variable is "propTun”, which is the proportion of tracking tunnels at a given site that recorded rat footprints (out of ~10 tunnels). My predictor variable is "PatchPC", which is the type of pest control being undertaken at each site, and is a factor with four levels: predator free (i.e. complete eradication, with all zeros for propTun), intensive control, intermediate control, and no control.
I initially modelled these data with a linear model, after logit-transforming propTun, and got the following results:
Call:
lm(formula = logit(propTun) ~ PatchPC, data = pest)
Residuals:
Min 1Q Median 3Q Max
-4.576 -1.608 0.000 1.680 4.944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.6636 0.5971 -6.135 0.0000000060309 ***
PatchPCIntensive control 2.3828 0.6876 3.465 0.000675 ***
PatchPCIntermediate control 4.5757 0.6564 6.971 0.0000000000708 ***
PatchPCNo control 3.9762 0.7079 5.617 0.0000000803328 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.313 on 166 degrees of freedom
Multiple R-squared: 0.2724, Adjusted R-squared: 0.2593
F-statistic: 20.72 on 3 and 166 DF, p-value: 0.00000000001885
These results were pretty much what I expected – i.e. differences in parameter estimates among the predator control categories, and a reasonable amount of variance explained. However, this model is not valid because there is a large differences in variance between the "predator free" category (which has a propTun value of zero for all sites) and the other categories.
A better approach should be a binomial GLM. However, this gave very different results [FYI “nCounts” is number of tunnels per site]:
Call:
glm(formula = propTun ~ PatchPC, family = "binomial", data = pest,
weights = nCounts)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.5735 -2.2339 -0.0005 2.4553 4.8196
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.12 425.76 -0.043 0.966
PatchPCIntensive control 17.33 425.76 0.041 0.968
PatchPCIntermediate control 18.73 425.76 0.044 0.965
PatchPCNo control 18.29 425.76 0.043 0.966
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1420.9 on 169 degrees of freedom
Residual deviance: 1083.1 on 166 degrees of freedom
Note this model has huge standard errors around the parameter estimates! Also, back-transforming the parameter estimates gives values of either ~0 (“ predator free" category AKA Intercept) or ~1, whereas graphs of the data show that average values should be 0 for the "predator free" category then ~0.3, 0.6, and 0.6 for the other three. Which makes me think something has gone wrong...
I wondered if having a category with all zeros was somehow messing up the results, so reran the model after subsetting the data to exclude the "predator free" category. And I got output more like was expecting:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7859 0.1005 -7.817 0.00000000000000542 ***
PatchPCIntermediate control 1.3989 0.1273 10.990 < 0.0000000000000002 ***
PatchPCNo control 0.9622 0.1455 6.611 0.00000000003816547 ***
I'm confused here as to why the original binomial GLM (i.e. the one including the “predator free" sites) is giving such different results from the other two models. Assuming it's the one that is giving misleading results, why does it have such large standard errors, why is it estimating propTun to be either ~0 or ~1, and why did removing the sites with the zeros fix the problem?
Thanks for your help,
Jay