2

My dataset is similar to this; it describes the presence/absence of a parasite in six species of animal from two locations. Each row in the dataset represents a different individual animal.

I would like to look at how parasite infection varies across species and between locations. However, infection is absent in one location for all species and absent in both locations for two species. A summary of the example data is below.

table(example$fparasite.pres, example$species, example$location)

, ,  = location.1

   
    species.1 species.2 species.3 species.4 species.5 species.6
  0       169        19        63         0         0        61
  1         0         0         0         0         0         0

, ,  = location.2

   
    species.1 species.2 species.3 species.4 species.5 species.6
  0        45        36        94       242        65       124
  1         1         0         0       318        11        32

When I try to run a glm for this data I get a warning message suggesting I have complete separation in my data - not surprising.

glm(parasite.pres ~ location + species, family = binomial, data = example)

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

Question: In such a situation, is it appropriate to use multiple chi-squared and/or Fisher exact tests to statistically assess relationships separately? I would be looking to compare parasite presence separately for each species between the two locations, and then compare parasite presence between each species within each location.

I note that there is a great post here detailing different options to try when you encounter complete separation in a logistic model, however I'm not sure that these qptions are all relevant/appropriate here and several of the suggested packages no longer exist in current versions of R.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Pat Taggart
  • 677
  • 2
  • 9
  • *Separation* is an issue for maximum likelihood estimation in that the model won't converge but not for OLS. Try running an OLS ANOVA which will converge but will provide no information for those parameters. –  Nov 17 '20 at 14:12
  • OLS will not be a good choice here, @user332577. – gung - Reinstate Monica Nov 17 '20 at 15:50
  • @gung-ReinstateMonica I agree. Where I disagree concerns the value one can place in linear, OLS models as baseline estimates against which more sophisticated models, such as your suggestions below, can be compared and evaluated, i.e., I'm always in favor of the simplest solutions as a starting point. –  Nov 19 '20 at 13:09

1 Answers1

2

There are two problems that arise from separation. First, you cannot estimate the slope of the relationship between the explanatory variables and the response (see the thread you linked). If you have some outside information you can use to constrain the possible slopes, or if you have a prior belief, you can use a Bayesian approach. (Firth logistic regression is a form of Bayesian logistic regression.) Similarly, you could use a regularization approach, say with a ridge penalty, if you had some sense of what value to use for lambda. All of these are a little tricky. Fortunately, you don't seem too concerned about estimating these relationships.

The second problem is that the Wald tests won't work. This problem is simpler. There are three possible kinds of tests (Wald, likelihood ratio, and score tests, see here); use one of the others. The score test is probably the best for this case. The chi-squared test is a score test, but it isn't for cases where you have two explanatory variables, as you do (or for continuous variables, which you don't have). So use your model, but get p-values for it based on one of the other tests. Here is an example, coded in R:

d
#  parasite.pres parasite.abs location   species
#            169            0        1 species.1
# ...
#             61            0        1 species.6
#             45            1        2 species.1
# ...
#            124           32        2 species.6
m = glm(cbind(parasite.pres, parasite.abs)~location+species, 
        d, family=binomial)
summary(m)
# ...
# Coefficients:
#                    Estimate Std. Error z value Pr(>|z|)    
# (Intercept)          56.412 141389.459   0.000   0.9997    
# location            -26.303  70694.729   0.000   0.9997    
# speciesspecies.2     24.021 111540.087   0.000   0.9998    
# speciesspecies.3     25.013 113322.774   0.000   0.9998    
# speciesspecies.4     -4.080      1.015  -4.021  5.8e-05 ***
# speciesspecies.5     -2.030      1.062  -1.911   0.0560 .  
# speciesspecies.6     -2.452      1.030  -2.380   0.0173 *  
# ...
# 
#     Null deviance: 5.2793e+02  on 9  degrees of freedom
# Residual deviance: 2.3543e-10  on 3  degrees of freedom
# AIC: 31.914
# 
# Number of Fisher Scoring iterations: 25
drop1(m, test="LRT")  # the likelihood ratio test
# ...
#          Df Deviance     AIC    LRT  Pr(>Chi)    
# <none>          0.00  31.914                     
# location  1    26.32  56.234  26.32 2.893e-07 ***
# species   5   282.98 304.895 282.98 < 2.2e-16 ***
drop1(m, test="Rao")  # the score test
# ...
#          Df Deviance     AIC Rao score  Pr(>Chi)    
# <none>          0.00  31.914                        
# location  1    26.32  56.234    16.843  4.06e-05 ***
# species   5   282.98 304.895   228.009 < 2.2e-16 ***
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Thanks @gung. I interpret ```drop1(m, test="Rao")``` that adding location & species improve the model. This demonstrated by high value for Rao score and small P-values. Is this correct? Why then would AIC increase, suggesting a less parsimonious model? I also note that if I run ```glm(parasite.pres ~ location*species, family = binomial, data = example)``` and then ```drop1(m, test="Rao")``` it suggests that ```location:species``` is not important (Rao score = 0; p-value = 1). I would expect it important given relationship between species is clearly different within location.1 VS location.2 – Pat Taggart Nov 17 '20 at 22:40
  • 1
    @PatTaggart, Yes, both are important. The AICs are for the models *without* those variables--that's why they're worse. You don't have enough information to tell if the relationship between species and presence differs by location. That's the point. It may look obvious to you, & it may make sense theoretically, but the data aren't sufficient to show that. – gung - Reinstate Monica Nov 17 '20 at 23:27
  • What do you mean that the $\chi^2$ test is a score test? Perhaps it’s from a couple of hours of likelihood ratio testing with $\chi^2$ distributions today, but I’m thinking of a likelihood ratio test when you say a $\chi^2$ test. – Dave Nov 18 '20 at 03:37
  • @Dave, see [here](https://en.wikipedia.org/wiki/Score_test#Special_cases). You can also see the p-values are identical in my linked answer. The [$G$-test](https://en.wikipedia.org/wiki/G-test) is the likelihood ratio version of the $\chi^2$-test. – gung - Reinstate Monica Nov 18 '20 at 14:49
  • @gung-ReinstateMonica So the G-test is what `lmtest::lrtest` on nested (multinomial) logistic regression models does, where one model is intercept-only, and the other has the intercept and a group variable? That's what I've been doing recently for testing proportions, and it's nice to see that it has a name and a Wikipedia article (I forget if `lrtest` handles multinomial logistic models, but it definitely handles binomial logistic models.) – Dave Nov 18 '20 at 15:41
  • I don't think so, @Dave. That's simply the ratio of 2 model likelihoods (or -2* the difference of their logs), which is distributed as chi-squared. The G-test is the likelihood ratio test of the independence of 2 categorical variables in a contingency table. Pearson's chi-squared test is a score test of the independence of 2 categorical variables in a contingency test. Both tests yield test statistics that are (asymptotically) distributed as chi-squared (which is also true of many other tests, such as the nested model test you mention). – gung - Reinstate Monica Nov 18 '20 at 17:42
  • Sorry @grung, one more follow up question - are the estimates and SE from ```summary(m)``` valid? or would I simply report that location and species was significant in the model and give no coefficient estimates and SE / provide no final model table. For example, "The most parsimonious GLM supported the inclusion of location and species as fixed effects (location: Rao score = 16.84, df = 1, P = <0.001; species: Rao score = 228.01, df = 5, p = <0.001)." – Pat Taggart Nov 24 '20 at 05:35
  • 1
    @PatTaggart, w/ separation, the true maximum likelihood estimate would be infinite. The numbers you get are because the algorithm to estimate them has a failsafe that that make it terminate eventually. As I wrote above, "you cannot estimate the slope of the relationship between the explanatory variables and the response (see the thread you linked)". The Wald SEs are nonsense. If you wanted to try something, you could try to profile the likelihood to get a score CI. – gung - Reinstate Monica Nov 24 '20 at 12:34