How to do preprocessing for zero inflated variable in multiple regression?

Question

I am trying to build a multiple regression model, and many of my variables looks like this (histogram for time spent in the system).

The reason I had such data is because zero is actually represents another business case: customer created the account but never used it.

How should I user this types of the variables in a regression model? I have some ideas to do the preprocessing, are they valid? what else can we do?

Idea 1, replace zero with median value of non-zero ones.
Idea 2, create another indicator column on zero values, then replace zero with median value of non-zero ones.

Why are you imputing zero values with medians when there is clearly non-ignorable missingness? Is the goal to estimate the time they *would* have been in the system had they remained there? If you can replace the zero with a sensible censoring time, like last-login or record modification time, you might estimate a good probability model. — AdamO, Feb 14 '18 at 17:34
@AdamO thanks for the comment. I understand the problem of replace them with median but not the second point. San you tell me more about "replace the zero with a sensible censoring time, like last-login or record modification time"? — Haitao Du, Feb 14 '18 at 18:09
Idea 2 is correct. Idea 1 is not. On a somewhat similar note, in case you missed it the thread "[Time spent in an activity as an independent variable](https://stats.stackexchange.com/questions/56306)" is extremely related to your question. (To the point I would assume they could be potentially merged.) — usεr11852, Feb 18 '18 at 21:31

rolando2 · Answer 1 · 2018-02-14T18:36:11.747

To take these zeroes and convert them to the median would create a false sense of certainty that would bias your results. I would assume that if these customers had nonzero values, the values would take on a range. That's one reason why multiple imputation has become so popular: it preserves some uncertainty, variability, in the imputed values while using all available information to assign them as plausibly as possible. So rather than proceeding as you suggest, I would use multiple imputation as one approach.

A second approach would be censored regression, as described in the growing number of threads that you can find via a Google search or a search on this site. I found this page helpful. (Though it may be less helpful if all your zero-inflated variables are predictors.)

In either case I like your idea of including a binary column to indicate zero/nonzero.

EDIT: for some good introductions to multiple imputation, see the articles by Melissa Azur et al. (2011); John W. Graham (2009); and Jeffrey C. Wayman (2003).

+1. The link you provided is really helpful. I searched for zero inflated topics but did not see that page! — Haitao Du, Feb 14 '18 at 18:10
Can you also give me some resources for multiple imputation? — Haitao Du, Feb 14 '18 at 18:19
Can you edit your post to include links to the papers, or at least also the journals in which they appeared? Thank you! — Stephan Kolassa, Feb 14 '18 at 18:49

How to do preprocessing for zero inflated variable in multiple regression?

1 Answers1