I'm working with a log-normal regression model. However, some of the dependent variable equal zero (not missing). Can I use an alternative specification like $log(y+1)$ ~ $X$ (most $y$s are really large)? Or should I just omit those observations? Do I have to do balance check every time I drop some observations and report all the result in appendix?
-
1Some zeros are not a problem with generalized linear models and logarithmic link. These work on the assumption that the mean is positive and do not use direct logarithmic transformation of the response. – Nick Cox Nov 15 '14 at 12:19
-
1Unless those observations are clearly erroneous (as determined by independent information), removing them would be mistake. The results would be biased. – whuber Nov 15 '14 at 14:17
-
1Very similar questions have extensive discussions at http://stats.stackexchange.com/questions/30728 and http://stats.stackexchange.com/questions/41361. – whuber Nov 15 '14 at 21:11
1 Answers
The answer to this question really depends on the meaning of $y=0$ in your application. You've said most $y$ are really large -- this means that those $y$'s that are zero are quite different from the other $y$'s. Are they really likely to be generated by the same process that generates the large $y$'s?
If $y=0$ is truly comparable with $y\gg0$ then you could use the approach you suggest of taking $\log(y+1)$. This would equate to a functional form of
$$ y_i = e ^ {\beta_0} e^ {\beta_1 x_i,1} ... e^{\beta_n x_i,n}e^{\epsilon_i}+1 $$
Note that the errors $\epsilon_i$ are exponential, not additive using this transformation.
If most $y$'s are really large, then $\log(y+1)$ will also be much larger when $y\neq0$, so including those observations for which $y=0$ will increase the error in your regression significantly. The distribution of error terms will be non-normal (and may effectively be a mixture of two different distributions). Coefficient estimates will potentially be biased if the process generating the $y=0$ terms is different to that generating the $y>0$ terms.
You could ignore the cases where $y=0$ entirely, if you considered that the aim of your model is to understand the relationship between the $y$'s and the $x$'s conditional of $y>0$.
Alternatively, and perhaps more satisfactorily if $y=0$ has meaning in your context, you could explicitly model the probability of $y=0$ vs $y>0$ using a binary model (ie a logitistic regression)and then have a sub-model that only considered the case where $y>0$. This would avoid omitting any data points, at the cost of having to consider what explanatory variables are suitable for explaining the probability of $y=0$.
Rob Hyndman's blog has some further discussion on transforming data with zeroes in it.

- 125
- 7
-
Thanks for your answer. I think I will just omit those observations. Actually $ y=0 $ cases are quite unusual. Maybe there're some mistakes in the records. I also appreciate your solution of running two regressions. This will definitely be a useful method in some other cases. – Brian Nov 15 '14 at 06:22
-
1Your characterization of leverage is erroneous. You also need to consider the effects of the $\log(y)+1$ transformation on the error terms, which have been neglected. – whuber Nov 15 '14 at 14:21
-
1@whuber I've removed the erroneous leverage point, and added the error terms to the function. – Will Scott Nov 16 '14 at 19:30