Very skewed and zero-inflated continuous outcome variable

Question

I am trying to predict a positive and continuous outcome by using the generalized linear model in R (glm function) and I am wondering what family could I use for the training data. Some of the families take the log of the outcome variable (if i understand this correctly) and the log of zero is -Inf. Note that almost the 80% of the outcomes are zero.

Here is what the outcome variable looks like

score 2 · Answer 1 · answered Dec 03 '18 at 11:47

I would not use any of the distributions available in GLM if I had an outcome variable like this. You need to widen your arsenal of methods. Consider e.g. a hurdle model or a zero inflated model (see this thread for some good discussion). Although the zero-inflated-normal is less common that zero inflated count models, it does exist.

For a review of methods that have been tried, see Min and Agresti (2002) Modeling nonnegative data with clumping at zero . Methods they discuss include:

Tobit regression
Two part models
Sample selection models
Compound Poisson exponential dispersion models
Ordinal threshold models

A more recent paper is by Eggers (2015) On statistical methods for zero-inflated models. She covers:

Tobit models
Sample selection models
Double hurdle models
Two part models

Very skewed and zero-inflated continuous outcome variable

1 Answers1