Alternative to log when doing linear regression on multiplicative dataset with zeros

Question

I'm doing work on a dataset that is approximately lognormally distributed, but with significant amounts of zeros (kinda like looking at forum post activity per subforum. For those who do post, the distribution of posts will be roughly lognormal, but there will be many who do not post at all on each subforum). As one would expect with lognormal data, there is 'cone-shaped' heteroscedasticity which becomes homoscedastic when the log of the variable is used (i.e. log(y) ~ log (x1) + log(x2) +... ).

However, the zeros are a problem. The dataset has 10000 entries and about 9 columns. When removing every row with at least one zero, the dataset becomes... 3.

So my question is - how are situations like this commonly handled, where you need the squashing behaviour of the log function while also preserving the zeros?

I've looked at log(y+a) where a is some small positive constant, and I've looked at arcsinh(y), but each of these present problems (the data often becomes bimodal with all the zeros clustered together far from the rest of the data, which becomes worse if the choice of constant a is very small), as well as just feeling, well a bit arbitrary.

As an experiment, you can try replacing the values of zero with very small numbers - say, 1.0E-10 or so - and test if that gives a sufficiently accurate result, with the understanding that model predictions of very small numbers are meant to represent zero values. This technique can allow the regression to proceed so that analysis can determine if the results give acceptable tentative results. I have found times when this technique can aid in preliminary analysis. — James Phillips, Dec 19 '17 at 11:34
The problem with adding an extremely small numbers like 1.0E-10, is that it will become a very extreme (negative) number once you take the log. This tends to become a pretty extreme outlier. — Maarten Buis, Dec 19 '17 at 12:41
As well as zero inflated distributions suggested by @Stephan Kolassa, integer values suggest a Poisson. If you need something more like lognormal consider a generalised Poisson. The Poisson-lognormal is awkward to work with, the negative binomial is easier to use. — user20637, Dec 19 '17 at 22:00
Maarten Buis has it right. I tried varying the constant from 1 to e^-40, and the result is that the R^2 value is artificially inflated as the zeros end up as large negative values, and the differences between the zeros and the rest of the data dominate the fit. — Ingolifs, Dec 19 '17 at 22:15

score 10 · Answer 1 · answered Dec 19 '17 at 08:55

It sounds to me like the data are fundamentally bimodal, being a mixture between a degenerate distribution at y=0 and a lognormal distribution for the rest of the data. This isn't something that any simple transform will fix. Instead, what I'd do here is account for the hierarchical structure that these data seem to have, with people first being split in to posters vs. non-posters, and then showing further (log-normal) variance within the posters category. So either just separate out the posters and only model their data, or keep all the data and use a mixture model that includes a prediction of category (poster vs. non-poster) membership.

Also look at [hurdle models](https://stats.stackexchange.com/questions/81457/what-is-the-difference-between-zero-inflated-and-hurdle-models) which are distinct from zero-inflation and may better describe the mechanism that you were talking about. — alex keil, Dec 19 '17 at 18:04

score 10 · Accepted Answer · answered Dec 19 '17 at 08:58

The easiest way is to use the log link function, i.e. build a model for $\log(\mathrm{E}(y))$ rather than $\mathrm{E}(\log(y))$. That way you leave the observations unchanged, so you don't end up with missing values. Typically such models are implemented as Poisson regression. You can use quasi-likelihood (robust standard errors) to avoid making too strong assumptions. See this blog post, and the references therein and in the comments to that blog post.

andyyy · Answer 3 · 2017-12-19T11:26:06.930

1

Some examples of the mixture distribution approach here:

https://www.r-bloggers.com/model-non-negative-numeric-outcomes-with-zeros/

One approach I've tried is to use left censored regression - replacing all the zeros by '< xsmall' (xsmall probably 1 in this case), and then using the log transform.

edited Dec 19 '17 at 11:26

answered Dec 19 '17 at 11:20

andyyy

111
3

score 1 · Answer 4 · answered Dec 19 '17 at 15:38

In marketing contexts, this is often handled by using two models:

In the first model, you are predicting whether the person will buy the product or not (post or not). This is a binomial situation.

In the second model, you are just using the purchasers (posters) and predicting how often they will post.

Alternative to log when doing linear regression on multiplicative dataset with zeros

4 Answers4