1

I am stuck in my thesis to conduct a GLM with cetacean data (count, density, effort and covariates). I have over 94% zeros in my dataset meaning I should be using a GLM with zero-inflated negative binomial/ or Poisson? Do I still need an offset within the equation? In other journal papers regarding cetaceans I have never read about ZINB or ZIP. Would it not be very common?

Thanks for your help.

  • 1
    Just because there are many zeros does not necessarily mean you need a zero inflated model; a Poisson or negative binomial regression may be adequate. I would recommend you follow [the model selection approach gung proposes in the proposed duplicate](https://stats.stackexchange.com/a/78977/1352). – Stephan Kolassa Jun 30 '21 at 10:05
  • 0. Welcome to CV.SE. 1. You are correct to question if a zero-inflated/hurdle model might relevant here. Please see my answer below for more details. – usεr11852 Jun 30 '21 at 22:16
  • But it looks like this Q has a better answer than the proposed dup ... – kjetil b halvorsen Jul 01 '21 at 22:45

1 Answers1

1

Indeed having 94% of zeros sounds like a rather large proportion of zeros so your original idea of using a zero-inflated or a hurdle model is not unfounded. Any reasonable analyst would. :)

That said, as Stephan mentioned in his comments, the large proportion of zeros does not necessitate the need for a zero-inflated or a hurdle model. I think it is very likely that you will indeed need a ZI count model (94% seems very large without any context). I would suggest looking at some formal references for example: Hilbe's Modeling Count Data, Chapt. 7 "Problems with Zeros", is very nice and accessible. It mentions a number of approaches (e.g. Boundary likelihood ratio tests, Vuong tests, etc.) Zuur et al. Mixed Effects Models and Extensions in Ecology with R Chapt. 11 "Zero-Truncated and Zero-Inflated Models for Count Data" is also consider quite a standard reference.

Regarding the use of an offset: Using an offset is mostly relevant if it makes sense to view the response variable as part of a rate instead of raw counts (e.g. number of infected individuals per 100K). Without knowing your exact research question one cannot answer this definitively; the interpretation of offset has been covered a couple of times in this forum, eg. see the following threads for more info:

In any case, it would be good to consider using rootgrams to visualise your results. Kleiber & Zeileis (2014) Visualizing Count Data Regressions Using Rootograms is a good reference for the matter (free version here).

A final comment about the perceived lack of ZI model use in "journal papers regarding cetaceans": It might be the case that in the papers you have seen ZI/hurdle models were unnecessary, unable to be estimated correctly or were simply ignored; we don't know that. Do not hesitate using a "more sophisticated" model; ultimately it is a matter if such a model (hierarchical, spatial, zero-inflated, what have you) is relevant to our research question and if we can correctly estimate it.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Wow, thank you for the long and in-depth answer. I will follow your advice and get back to you if things remain unclear. – Leonie Lepple Jul 03 '21 at 15:45
  • I am glad I could help. If this answer is helpful please consider upvoting it and if it resolves your question marking it as the accepted answer. – usεr11852 Jul 03 '21 at 18:07
  • I realised that my GLM model is underdispersed: `M4 – Leonie Lepple Jul 05 '21 at 17:02
  • This is very likely OK. You can try a hurdle model as a simple solution or a generalised Poisson. There are some even more specialised models (e.g. [Conway-Maxwell-Poisson](https://en.wikipedia.org/wiki/Conway%E2%80%93Maxwell%E2%80%93Poisson_distribution-) but for that you describe they seem like an overkill. It is important to note that it will be good to examine as why you think under-dispersion occurs (if indeed it occurs). – usεr11852 Jul 05 '21 at 19:39