Which GLM(M) to use for proportional data?

Question

I have proportional data that takes any value from 0 and 1. However, there are certainly an over-abundance of 0's, and the rest of the values (there are few) tend to be close to zero.

I'm wondering with this knowledge what is the appropriate GLM to use?

Thanks in advance.

Do you know the denominators, i.e. how many 'trials' for each observation? — Ben Bolker, Jan 18 '22 at 23:27
Yes I believe so. To be more specific. the rate that I'm looking at is: # of a certain type of hatches / # of pupae. This made me think of doing a Poisson regression model with offset or I've also heard that its possible to use proportional data in a Binomial logistic regression. What do you think? Thanks for your help! — taylor weishaar, Jan 19 '22 at 00:24

score 1 · Accepted Answer · answered Jan 19 '22 at 01:14

If you know proportions and their denominators (in your case, "number of a certain type of hatches per number of pupae"), then a binomial response is (in my opinion) the most principled/sensible thing to do.

Lots of zeros are expected when the mean proportion is low; it's still possible that you need a zero-inflated binomial model, but unlikely (Warton 2005).
When the mean of a binomial is low, a Poisson model with an offset gives nearly identical results (see here; more precisely, the probability should be low everywhere (e.g. if you have a few combinations of covariates that lead to higher probabilities, that could mess things up).
As always you should check for overdispersion after fitting the model and if necessary do something appropriate (quasi-likelihood, observation-level random effects, beta-binomial model ...)

Warton, David I. “Many Zeros Do Not Mean Zero Inﬂation: Comparing the Goodness-of-FIt of Parametric Models to Multivariate Abundance Data.” Environmetrics 16 (2005): 275–89. https://doi.org/10.1002/env.702.

+1. To clarify something, if we use a logit we are primarily looking at a fractional logit model. — usεr11852, Jan 19 '22 at 03:00

score 0 · Answer 2 · answered Jan 19 '22 at 03:22

Complementary to Ben's answer. (+1)

You might want to consider looking into beta regression too. It will allow you to model explicitly the response as a percentage and it can be very flexible regarding the shape of the final response distribution, the coefficients have the usual interpretation as being in the log-odds domain; a counter-argument would be that beta regression is primarily geared towards continuous proportions and it does not handle {0,1} directly but if it makes sense to consider a hurdle/zero-inflated model so it can be a reasonable candidate (see the zoib package if you are working with R).

CV.SE has some great threads on this already:

The R package betareg has a very informative vignette too.

(Using a Poisson GLMM with an offset would be my first pick too but I add this for a more complete view of the subject.)

This is not wrong or crazy, but honestly I don't think it's a good idea. Zero-inflated beta-binomial is often a sort of "code smell" - you really have to think about why the zeros are there. Do they represent a detection threshold (in which case you might want to use a censoring model)? Do they represent the first part of a two-step process (in which case a standard hurdle-type model makes sense)? Is it some kind of compound distribution (in which case you might want a Tweedie?) If you really have integer denominators available, then using a binomial is so much more natural ... — Ben Bolker, Jan 22 '22 at 00:10
Credit where is due, that is a good comment. (+1) Yes I agree. As I write, I would go with your point two ("a Poisson model with an offset") as my first choice, I wouldn't go with a beta-binomial from the get-go either. — usεr11852, Jan 22 '22 at 01:35

Which GLM(M) to use for proportional data?

2 Answers2

Linked