2

I am trying to predict a positive and continuous outcome by using the generalized linear model in R (glm function) and I am wondering what family could I use for the training data. Some of the families take the log of the outcome variable (if i understand this correctly) and the log of zero is -Inf. Note that almost the 80% of the outcomes are zero.

Here is what the outcome variable looks like

Outcome Variable histogram

1 Answers1

2

I would not use any of the distributions available in GLM if I had an outcome variable like this. You need to widen your arsenal of methods. Consider e.g. a hurdle model or a zero inflated model (see this thread for some good discussion). Although the zero-inflated-normal is less common that zero inflated count models, it does exist.

For a review of methods that have been tried, see Min and Agresti (2002) Modeling nonnegative data with clumping at zero . Methods they discuss include:

  1. Tobit regression
  2. Two part models
  3. Sample selection models
  4. Compound Poisson exponential dispersion models
  5. Ordinal threshold models

A more recent paper is by Eggers (2015) On statistical methods for zero-inflated models. She covers:

  1. Tobit models
  2. Sample selection models
  3. Double hurdle models
  4. Two part models
Peter Flom
  • 94,055
  • 35
  • 143
  • 276