2

I am trying to build a multiple regression model, and many of my variables looks like this (histogram for time spent in the system).

enter image description here

The reason I had such data is because zero is actually represents another business case: customer created the account but never used it.

How should I user this types of the variables in a regression model? I have some ideas to do the preprocessing, are they valid? what else can we do?

  • Idea 1, replace zero with median value of non-zero ones.
  • Idea 2, create another indicator column on zero values, then replace zero with median value of non-zero ones.
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 1
    Why are you imputing zero values with medians when there is clearly non-ignorable missingness? Is the goal to estimate the time they *would* have been in the system had they remained there? If you can replace the zero with a sensible censoring time, like last-login or record modification time, you might estimate a good probability model. – AdamO Feb 14 '18 at 17:34
  • @AdamO thanks for the comment. I understand the problem of replace them with median but not the second point. San you tell me more about "replace the zero with a sensible censoring time, like last-login or record modification time"? – Haitao Du Feb 14 '18 at 18:09
  • Idea 2 is correct. Idea 1 is not. On a somewhat similar note, in case you missed it the thread "[Time spent in an activity as an independent variable](https://stats.stackexchange.com/questions/56306)" is extremely related to your question. (To the point I would assume they could be potentially merged.) – usεr11852 Feb 18 '18 at 21:31

1 Answers1

4

To take these zeroes and convert them to the median would create a false sense of certainty that would bias your results. I would assume that if these customers had nonzero values, the values would take on a range. That's one reason why multiple imputation has become so popular: it preserves some uncertainty, variability, in the imputed values while using all available information to assign them as plausibly as possible. So rather than proceeding as you suggest, I would use multiple imputation as one approach.

A second approach would be censored regression, as described in the growing number of threads that you can find via a Google search or a search on this site. I found this page helpful. (Though it may be less helpful if all your zero-inflated variables are predictors.)

In either case I like your idea of including a binary column to indicate zero/nonzero.

EDIT: for some good introductions to multiple imputation, see the articles by Melissa Azur et al. (2011); John W. Graham (2009); and Jeffrey C. Wayman (2003).

rolando2
  • 11,645
  • 1
  • 39
  • 60