33

I'm currently trying to apply a linear model (family = gaussian) to an indicator of biodiversity that cannot take values lower than zero, is zero-inflated and is continuous. Values range from 0 to a little over 0.25. As a consequence, there is quite an obvious pattern in the residuals of the model that I haven't managed to get rid of: enter image description here

Does anyone have any ideas on how to solve this?

amoeba
  • 93,463
  • 28
  • 275
  • 317
David
  • 331
  • 1
  • 4
  • 3
  • 4
    If it's zero-inflated it cannot be continuous, since continuous variables cannot have any jumps in the cdf (and there's clearly one at 0). It may be continuous aside from the 0's. – Glen_b Dec 21 '15 at 22:54
  • Related: https://stats.stackexchange.com/questions/105320 – amoeba Nov 12 '17 at 00:40

2 Answers2

55

There are a variety of solutions to the case of zero-inflated (semi-)continuous distributions:

  • Tobit regression: assumes that the data come from a single underlying Normal distribution, but that negative values are censored and stacked on zero (e.g. censReg package). Here is a good book about Tobit model, see chapters 1 and 5.
  • see this answer for other censored-Gaussian alternatives
  • hurdle or "two-stage" model: use a binomial model to predict whether the values are 0 or >0, then use a linear model (or Gamma, or truncated Normal, or log-Normal) to model the observed non-zero values (typically you need to roll your own by running two separate models; combined versions where you fit the zero component and the non-zero component at the same time exist for count distributions such as Poisson (e.g glmmTMB, pscl); glmmTMB will also do 'zero-inflated'/hurdle models for Beta or Gamma responses)
  • Tweedie distributions: distributions in the exponential family that for a given range of shape parameters ($1<p<2$) have a point mass at zero and a skewed positive distribution for $x>0$ (e.g. tweedie, cplm, glmmTMB packages)

Or, if your data structure is simple enough, you could just use linear models and use permutation tests or some other robust approach to make sure that your inference isn't being messed up by the interesting distribution of the data.

There are R packages/solutions available for most of these cases.

There are other questions on SE about zero-inflated (semi)continuous data (e.g. here, here, and here), but they don't seem to offer a clear general answer ...

See also Min & Agresti, 2002, Modeling Nonnegative Data with Clumping at Zero: A Survey for an overview.

Ben Bolker
  • 34,308
  • 2
  • 93
  • 126
  • @Ben Bolker Would you "use a linear model (or Gamma, or truncated Normal, or log-Normal) to model the" predicted or actual non-zero values? – rolando2 Jan 14 '17 at 15:05
  • The packages `gamlss` and `gamlss.inf` provide the function `gamlssZadj` which allows to fit a two-part model for any distribution defined on the positive real line. It fits logit-model for the zeros and a gamlss model for the positive part of the data simultaneously. – COOLSerdash Oct 03 '21 at 20:57
1

You can also use the Poisson Pseudo-Maximum Likelihood (PPML). It was firstly developed by Santos Silva and Tenreyero (2006) for the application of international trade among countries. In 2011, the same authors extended the analysis of the PPML's performance (see in here). They also have this page with some material about the model. Later, it was used in many other applications. In my field, it was used in the energy economics, policy and regulation fields (for instance, Zhao et al. (2013), De Groote et al. (2016), Gautier and Jacqmin (2020))

In Stata you can use with the ppmlhdfe command and its implementation is here.

morebru
  • 151
  • 2