Zero-inflated two-part models for semi-continuous data

Question

I am trying to study predictors of companies' pollution output of some specific chemicals. The data I am using have many 0's (i.e., the company did not pollute at all with those chemicals) and then are continuous with a long right tail. I have seen others model this data by logging the dependent variable after adding 1. My sense is that this is wrong, but I don't understand why. Could someone explain? This approach is much simpler than what I think I should be doing - using zero-inflated two-part models for semi-continuous data - so I'd be thrilled if it turned out simply adding 1 and logging is right.

Second, I have found a Stata ado file to run zero-inflated two-part models for semi-continuous data. Is there a way to incorporate fixed effects into this type of model?

dimitriy · Answer 1 · 2014-02-20T19:00:02.040

Disadvantages of $\ln(0+c)$:
- $c=1$ is arbitrary. Often the value of $c$ changes estimates, so you need to conduct a grid search for the "optimal" result and justify that choice in the end
- Zero mass may respond differently to covariates (extensive vs. intensive margin may have different DGPs)
- Retransformation back to natural scale problem is worse at the low end if you want to predict $y$
- Sometimes works poorly. See Duan, N., W.G. Manning, et al. “A Comparison of Alternative Models for the Demand for Medical Care,” Journal of Business and Economics Statistics, 1:115-126, 1983 for some examples. (gated JSTOR link, RAND working paper link).
There's no panel version of tpm. I would try using dummies and clustering on the panel id if computationally possible. I might also give xtpoisson, fe robust or xtpqml (a user-written wrapper) a whirl, justifying it as Quasi-MLE, which has performed well in CS simulations even when the number of zeros is large.

[`xtnbreg`](http://www.stata.com/help.cgi?xtnbreg) is good to note as well. I've estimated good fitting models with up to 65% zero observations using the negative binomial. — Andy W, Feb 20 '14 at 18:50

score 1 · Accepted Answer · edited Feb 20 '14 at 18:35

1

Not sure about Stata, but R can run zero-inflated models with fixed effects. Check out, for example, the gamlss package and zeroinfl() from the pscl package.

edited Feb 20 '14 at 18:35

chl

50,972
18
205
364

answered Feb 20 '14 at 17:23

a11msp

743
6
20

Zero-inflated two-part models for semi-continuous data

2 Answers2

Linked