GLMM for overdispersed data

Question

I am analyzing data from 3 field experiments (farms=3) for a citrus flower disease: response variable is binomial because the flower can only be diseased or healthy.

I have particular interest in comparing 5 fungicide spraying systems (trt=5). I am not interested in the effect of a specific farm, they simply represent the total of farms from the region where I want to suggest the best treatments.

Each farm had 4 blocks (bk=4) including 2 trees as subsamples (tree=2) in which I assessed 100 flowers each one.

This is a quick look of the data:

dinc <- within(dinc, { tree_id <- as.factor(interaction(farm, trt, bk, tree)) })

farm      trt      bk   tree   dis   tot  tree_id
<fctr>   <fctr>  <fctr> <fctr><int> <int> <fctr>
iaras   Calendar   1      1     0   100  iaras.Calendar.1.1
iaras   Calendar   1      2     1   100  iaras.Calendar.1.2
iaras   Calendar   2      1     1   100  iaras.Calendar.2.1
iaras   Calendar   2      2     3   100  iaras.Calendar.2.2

The model I considered was:

resp <- with(df, cbind(dis, tot-dis)) 

m1 = glmer(resp ~ trt + (1|farm/bk) , family = binomial, data=df)

I tested the overdispersion with the overdisp_fun() from GLMM page

        chisq         ratio             p          logp 
 4.191645e+02  3.742540e+00  4.804126e-37 -8.362617e+01

As ratio (residual dev/residual df) > 1, and the p-value < 0.05, I considered to add the observation level random effect (link) to deal with the overdispersion.

so now was added a random effect for each row (tree_id) to the model, but I am not sure of how to include it. This is my approach:

m2 = glmer(resp ~ trt + (1|farm/bk) + (1|tree_id), family = binomial, data=df)

I also wonder if farm should be a fixed effect, since it has only 3 levels...

m3 = glmer(resp ~ trt * farm + (1|farm:bk) + (1|tree_id), family = binomial, data=df)

I really appreciate your suggestions about my model specifications...

What is the source of your over-dispersion? Are you simply having a zero-inflated model? Or looking at a rare event? Can you please produce some plots of your data as well as the output of your model? — usεr11852, Apr 23 '17 at 00:23
Thank you for this additional information. Why is the response between $[0,1]$? I thought you only had 0 (ie. healthy) and 1 (ie. infected). Is `resp` the proportion of infected flowers out of the 100 you checked on each tree? (Just to be clear your `m1` and `m2` seem perfectly reasonable at first glance) — usεr11852, Apr 23 '17 at 12:21
Yes, `resp` is the `diseased | total-diseased` flowers assessed at each tree. Only for the plot I used diseased/total. — Juanchi, Apr 24 '17 at 18:06

Reid · Accepted Answer · 2017-04-25T18:03:17.703

1

Overdispersion Problem

It looks like you're modeling a count variable as a binomial and I think that's the source of your overdispersion.

You could model everything as a binomial distribution, but the total for each observation is exactly the same. ~~Plus, the count of diseased plants never reaches the maximum of 100, so it's not really censored the way a binomial would be.~~

EDIT: So, you could easily report this as a "rate" of disease over the total sample. In this way you could analyze the 'count' of disease or proportion (disease / total) as a negative binomial model.

EDIT2: Because there seems to be some hesitance to use a negative binomial, here is a list of recent phytopathology articles (same discipline as OP) that model disease as a negative binomial (Prager et al., 2014, Mori et al., 2008, Passey et al., 2017, Paiva de Almeida et al., 2016)

A histogram of your y variable looks like a zero inflated negative binomial.

Note the long right tail that you typically see with a negative binomial or Poisson.

There are a few different ways to handle this, but here's an easy solution:

m4<-glmer.nb(dis ~ trt + (1 | farm/bk),data = dinc)

summary(m4)
overdisp_fun(m4)

I got the following overdispersion results:

      chisq       ratio         rdf           p 
122.1655582   1.0811111 113.0000000   0.2617332

Looks good, right?

(EDIT: Ignore strikethrough portion below)

~~### Side Issue: Your Trees are Independent Observations~~

At first, it looks like each of the two trees should be a random effect.

However, Tree 1 on farm 1 is not comparable to Tree 1 in farm 2. Therefore, you don't want to model the effect of Tree as a random effect. Imagine if each Tree was a different person. Adding a random effect for each person wouldn't matter unless you had multiple observations per person.

Similarly, including the farm "block" doesn't really have an effect on the model.

Alternative Models and Final Thoughts

Could potentially check out zero inflated negative binomial
Although your dispersion doesn't seem bad with standard nb
The MASS package is an alternative way to run a nb model
Additionally you could run this as a Quasi-Poisson
I'll include some code below, in case you want to pursue this

 require("MASS")

 m5<-glmmPQL(dis ~ trt ,
             random = ~ 1 | farm/bk,
             family = negative.binomial(theta=9.86), 
             data = dinc)

 summary(m5)

 m6<-glmmPQL(dis ~ trt ,
             random = ~ 1 | farm/bk,
             family = quasipoisson(link='log'), 
             data = dinc)

 summary(m6)

Best of luck with your model!

EDIT In case you'd like to run this as a "rate", please try this code:

dinc$dis_prob<- dinc$dis / dinc$tot 

m7<-glmmPQL(dis_prob ~ trt ,
             random = ~ 1 | farm/bk,
             family = quasipoisson(link='log'), 
             data = dinc)

summary(m7)

edited Apr 25 '17 at 18:03

answered Apr 25 '17 at 00:51

Reid

521
5
12

1

There's a lot of good stuff in this answer, but also some confusing/wrong stuff. Just because the counts never approach 100 doesn't mean the data "aren't censored the way a binomial should be." And the bit about "side issue: your trees are independent observations" misses the point. The OP is not estimating a random effect of non-comparable tree 1 vs. tree 2 across farms. The OP is estimating a random effect with separate levels for every individual tree in the dataset, which is a common and useful way to deal with overdispersion. And lastly, the farm block DOES affect the model. – Jacob Socolar Apr 25 '17 at 04:48
Reid, thanks for your answer... however I think the variable is the proportion diseased/total and that's the way I want to report the results and plots (maybe with lsmeans contrasts). That's the way everybody in the phytopatological comunity handle this type of assessments. – Juanchi Apr 25 '17 at 11:38
@user43849 - Regarding the 'censored' comment, I just meant that there's no a priori reason that this needs to be a binomial model. One potential reason could be a distribution of responses clustering around 0 and 1. I'll edit to clarify. Regarding the Tree problem: if the OP is using a unique identifier, it was not included in the dataset. The one I downloaded included only values of 1 and 2, alternating. The farm block might theoretically affect the model, but it did not change any values in my testing. All values remained constant in my negative binomial model. – Reid Apr 25 '17 at 16:54
@Juanchi Running your model as a negative binomial won't really alter your conclusions! Your interpretation will be regarding the "rate" of disease count per 100. If you'd like, you can divide disease / count and directly produce your "rate" variable, then model using a quasipoisson or nb model. I guarantee you that you won't be the first phyotpathologist to analyze a count or rate using a negative binomial. – Reid Apr 25 '17 at 17:01
@user43849 Regarding the tree_id, I'm realizing now the provided dataset differed from the head provided in the original question. I think that was what caused the discrepancy. Still, I suspect that a unique ID for a tree with only one observation will not have much of an effect on the model. However, I cannot comment without exploring whether there are multiple observations per tree. – Reid Apr 25 '17 at 17:06
1

@Reid you´re right, I edited the link of the data. Each tree has 100 observations (flowers) that accept 0 (healthy) or 1 (diseased). I will test your models. Just to know: does `glmmPQL` accept quasi-binomial? – Juanchi Apr 25 '17 at 18:12
1

I will keep this last model...I also tried it with quasi-binomial (m8) and the plot(m8) seems to be better than plot(m7). How do you suggest to compare these models Goodness of fit? – Juanchi Apr 25 '17 at 18:41
@Juanchi model selection is a little complicated with a PQL model, so check out the discussion [here.](https://stats.stackexchange.com/questions/185491/diagnostics-for-generalized-linear-mixed-models-specifically-residuals?rq=1) – Reid Apr 25 '17 at 20:06
What do you think, @Juanchi ? Did this solve your overdispersion problem? Also, note citable sources for overdispersion included in the original answer. – Reid Apr 26 '17 at 00:30

score 0 · Answer 2 · answered Apr 24 '17 at 16:06

you are correct that you have modeled it appropriately. Each of your flowers is "nested" under a tree and so are not independent of each other. Your code is appropriate where you have allowed the intercept to vary by trees.

It also looks like you have examined the intraclass correlation (i.e. the overdisp_fun() that you used).

Further, since farm has three levels, it is appropriate to just treat it as fixed (especially if you dont really care about the difference). In this case, you test the inclusion of the fixed levels, and if they do not improve fit then you can discard them.

Make sure that you are examining the AIC and BIC to help with model construction.

GLMM for overdispersed data

2 Answers2

Overdispersion Problem

Alternative Models and Final Thoughts