Regarding log-normal specification

Question

I'm working with a log-normal regression model. However, some of the dependent variable equal zero (not missing). Can I use an alternative specification like $log(y+1)$ ~ $X$ (most $y$s are really large)? Or should I just omit those observations? Do I have to do balance check every time I drop some observations and report all the result in appendix?

Some zeros are not a problem with generalized linear models and logarithmic link. These work on the assumption that the mean is positive and do not use direct logarithmic transformation of the response. — Nick Cox, Nov 15 '14 at 12:19
Unless those observations are clearly erroneous (as determined by independent information), removing them would be mistake. The results would be biased. — whuber, Nov 15 '14 at 14:17
Very similar questions have extensive discussions at http://stats.stackexchange.com/questions/30728 and http://stats.stackexchange.com/questions/41361. — whuber, Nov 15 '14 at 21:11

Will Scott · Accepted Answer · 2015-06-29T07:47:39.127

The answer to this question really depends on the meaning of $y=0$ in your application. You've said most $y$ are really large -- this means that those $y$'s that are zero are quite different from the other $y$'s. Are they really likely to be generated by the same process that generates the large $y$'s?

If $y=0$ is truly comparable with $y\gg0$ then you could use the approach you suggest of taking $\log(y+1)$. This would equate to a functional form of

$$ y_i = e ^ {\beta_0} e^ {\beta_1 x_i,1} ... e^{\beta_n x_i,n}e^{\epsilon_i}+1 $$

Note that the errors $\epsilon_i$ are exponential, not additive using this transformation.

If most $y$'s are really large, then $\log(y+1)$ will also be much larger when $y\neq0$, so including those observations for which $y=0$ will increase the error in your regression significantly. The distribution of error terms will be non-normal (and may effectively be a mixture of two different distributions). Coefficient estimates will potentially be biased if the process generating the $y=0$ terms is different to that generating the $y>0$ terms.

You could ignore the cases where $y=0$ entirely, if you considered that the aim of your model is to understand the relationship between the $y$'s and the $x$'s conditional of $y>0$.

Alternatively, and perhaps more satisfactorily if $y=0$ has meaning in your context, you could explicitly model the probability of $y=0$ vs $y>0$ using a binary model (ie a logitistic regression)and then have a sub-model that only considered the case where $y>0$. This would avoid omitting any data points, at the cost of having to consider what explanatory variables are suitable for explaining the probability of $y=0$.

Rob Hyndman's blog has some further discussion on transforming data with zeroes in it.

Thanks for your answer. I think I will just omit those observations. Actually $ y=0 $ cases are quite unusual. Maybe there're some mistakes in the records. I also appreciate your solution of running two regressions. This will definitely be a useful method in some other cases. — Brian, Nov 15 '14 at 06:22
Your characterization of leverage is erroneous. You also need to consider the effects of the $\log(y)+1$ transformation on the error terms, which have been neglected. — whuber, Nov 15 '14 at 14:21
@whuber I've removed the erroneous leverage point, and added the error terms to the function. — Will Scott, Nov 16 '14 at 19:30

Regarding log-normal specification

1 Answers1