Inverting data for Zero-inflated mixed effects models

Question

I am looking for some advice on my analyses, I have been going back and forth between co-authors about the validity of my approach and would appreciate some external input.

My data are derived from mutli-country household surveys of rural farmers living under different landuse configurations. We are focusing on frequency of food consumption to examine dietary patterns. In order to do so we asked how many days did you consume a particular food group in the last 7 days? (7day recall). I am applying mixed effects models run using the glmmTMB package in R to account for the nested nature of the experimental design. I am happy with this approach but am being questioned as to my logic for using Zero-inflated models. I will explain below and would appreciate any comments, clarifications, suggested alternatives or even a "you cant do that".

There are two main issues I am having with my coauthors.

The "inverting" of the data in order to model with a zero-inflated model
The assumptions I am making about there being "real" and "structural" zeros and two different processes determining them.

So my data looks like this for meat and vegetable consumption.

there are obviously lots of people who are eating both food groups 7 days a week and this is potentially inflating the data, especially with the vegetable distribution of counts.

My approach was to "invert" the data so instead it would represent a count of the number of days a respondent went without consuming that food group.

In my mind this makes the data more manageable from a modelling perspective and was my starting point to examining which models would be appropriate.

My first question is: Is it OK to do this from a modelling and statistics point of view?

From here I ran a bunch of models and examined the model fits, the best fits were either zero-inflated Poisson or zero-inflated neg binomial.

The biggest blowback I have been getting (apart from the inverting which I think is fine) is that my coauthors do not agree that there could be two sources of the zeros (which is actually two sources of 7days consumption in its original form before I inverted the data).

I have suggested that there are respondents we surveyed that would always answer 7 days (structural zeros) and respondents that answered 7 days who have a variable weekly intake but just happend to eat 7 days that week (real zeros).

I assume that if people have the the opportunity to eat these food groups daily then they do. So because these people are farmers and have access to these food groups either because they grow them themselves or can readily collect them from the forest or purchase/trade for them by being wealthy enough (all these factors are in controlled for in the model). It seems reasonable to me to assume that some of the people who answered '7 days' are from a group of people that would only ever answer '7 days' (structural zeros) and therefore the zero-inflated models are appropriate, remembering that because I inverted the data a structural zero actually means being from the group of people who could only ever answer zero days without consumption of that particular food group. So there is one set of processes that determines the odds of one someone only being able (close enough to only) to answer 7days and another set of processes that determine the frequency of consumption for those who cannot only answer 7days.

My second question is: Is this approach appropriate and does it justify the use of zero inflated models?

I would really appreciate people to point out any flaws in my approach or my logic and suggest alternatives or point me in the direction of alternative approaches. I have looked into quite a few different approaches and I always end up back here.

It should be noted that the outputs from the models make perfect sense in terms of the significant factors driving each process for real and structural zeros(I would be worried if they were throwing weird results).

Many thanks in advance!

Model the outcome and reversed outcome while ignoring the zero-inflation using the count model. You will find that changing the direction affects the estimated coefficients, not just affecting the signs, but the magnitudes change. The answer below from @DimitrisRizopoulos is a reasonable approach that will not be affected by the value you choose as success since it is binomial, it should only reverse the sign. glmmTMB permits cbind(success, failure) outcome type which is what you'll need to implement. See its manual. — Heteroskedastic Jim, Nov 17 '18 at 05:04
this would be a beta binomial model right? if so how would I deal with the 1's and 0's? I have looked into the `zoib` package in R but it cannot handle nested random effects. — Josh Van Vianen, Nov 20 '18 at 03:46
Not a beta-binomial, a regular multilevel binomial as suggested in the single answer you have. Was only pointing out that using a model with a log link and reversing the outcome has consequential effects on the results you find. So the choice to reverse should not be arbitrary. But you can arbitrarily reverse if using binomial with logit link. — Heteroskedastic Jim, Nov 20 '18 at 03:58

Dimitris Rizopoulos · Answer 1 · 2018-11-17T04:57:49.690

4

The negative binomial model is typically used for count data with no upper limit, i.e., the outcome variable $Y$ can take values in the set $[0, 1, 2, \ldots, \infty)$. However, in your case you can only have a maximum count of 7.

A more appropriate distribution for your data is the binomial distribution, which describes the number of “successes” (i.e., in your case success = eating meat & vegetables) in $N$ trials (i.e., in your case $N = 7$ is the number of days).

P.s., note that in both cases of negative binomial and Binomial mixed models you need to be careful with the interpretation of the estimated fixed-effects coefficients; for more details on this check this post.

edited Nov 17 '18 at 04:57

answered Nov 17 '18 at 04:48

Dimitris Rizopoulos

17,519
2
16
37

Thanks for the great advice, I was wondering though does each "trial" need to be independent? I think this might be a problem in this case as its likely that the probability of eating meat one day will influence the probability of eating meat more than once over the week period. Someone eating meat 6 days for example will have a higher probability of eating eat meat on the 7th day. Is this something I need to worry about? if so how do you propose I deal with it? – Josh Van Vianen Nov 19 '18 at 11:19
Note that you assume that the trials are independent *given* the random effects. – Dimitris Rizopoulos Nov 19 '18 at 11:29
You can deal with the non-independence/non-homogeneity of the trials (@DimitrisRizopoulos: I think the OP means independence of trials within observations ...) by any of the standard methods for incorporating *overdispersion*, i.e. an observation-level random effect or a Beta-binomial model or a quasibinomial estimate ... – Ben Bolker Nov 20 '18 at 19:12
2

@BenBolker my point was that by including the Gaussian random effects to model the correlations over time you also account for overdispersion. It could that this is not sufficient in which case you could indeed do something extra. – Dimitris Rizopoulos Nov 21 '18 at 05:34
Sorry for the slow uptake on this. @BenBolker I was indeed worried about the independence of trials within observations i.e. the independence of successive 'failures' or 'successes' if I think of 7days as 7 'trials' and run a binomial model. I am familiar with observation-level random effects which I will run and look into but not beta-binomials in GLMMTMB. Is it appropriate to compare AIC scores between the two? – Josh Van Vianen Jan 03 '19 at 13:15
@BenBolker if I do go for a beta regression implemented in `glmmTMB` do I need to worry about the 0's and 1's. or do you recommend a transform to get my [0,1] distribution of proportions to (0,1)? – Josh Van Vianen Jan 03 '19 at 14:02

Inverting data for Zero-inflated mixed effects models

1 Answers1