7

I am trying to study predictors of companies' pollution output of some specific chemicals. The data I am using have many 0's (i.e., the company did not pollute at all with those chemicals) and then are continuous with a long right tail. I have seen others model this data by logging the dependent variable after adding 1. My sense is that this is wrong, but I don't understand why. Could someone explain? This approach is much simpler than what I think I should be doing - using zero-inflated two-part models for semi-continuous data - so I'd be thrilled if it turned out simply adding 1 and logging is right.

Second, I have found a Stata ado file to run zero-inflated two-part models for semi-continuous data. Is there a way to incorporate fixed effects into this type of model?

dimitriy
  • 31,081
  • 5
  • 63
  • 138
user40622
  • 73
  • 1
  • 3

2 Answers2

4
  1. Disadvantages of $\ln(0+c)$:

    • $c=1$ is arbitrary. Often the value of $c$ changes estimates, so you need to conduct a grid search for the "optimal" result and justify that choice in the end
    • Zero mass may respond differently to covariates (extensive vs. intensive margin may have different DGPs)
    • Retransformation back to natural scale problem is worse at the low end if you want to predict $y$
    • Sometimes works poorly. See Duan, N., W.G. Manning, et al. “A Comparison of Alternative Models for the Demand for Medical Care,” Journal of Business and Economics Statistics, 1:115-126, 1983 for some examples. (gated JSTOR link, RAND working paper link).
  2. There's no panel version of tpm. I would try using dummies and clustering on the panel id if computationally possible. I might also give xtpoisson, fe robust or xtpqml (a user-written wrapper) a whirl, justifying it as Quasi-MLE, which has performed well in CS simulations even when the number of zeros is large.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • [`xtnbreg`](http://www.stata.com/help.cgi?xtnbreg) is good to note as well. I've estimated good fitting models with up to 65% zero observations using the negative binomial. – Andy W Feb 20 '14 at 18:50
1

Not sure about Stata, but R can run zero-inflated models with fixed effects. Check out, for example, the gamlss package and zeroinfl() from the pscl package.

chl
  • 50,972
  • 18
  • 205
  • 364
a11msp
  • 743
  • 6
  • 20