1

I have a predictor variable that has many zeros. The predictor variable is simply a count of the occurrences of some behavior. The zeros are qualitatively meaningful. I'd like to use a log transformation of the variable but am unsure what to do with the zeros - again - because they are qualitatively meaningful. I'd rather not simply add 1 to the variable before logging it.

I've seen others allude to some example from Hosmer and Lemeshow where they use a dummy variable to indicate if a person smokes or not and a continuous variable accounting for the number of cigarettes smoked. I'm not sure exactly how this coding would work. Does the dummy need to be coded as 1 to indicate zero values of the other continuous variable or should 1 indicate positive values? I'd also like to log transform the continuous variable because I've got serious over dispersion. I'm guessing I'd log transform the positive values and just keep the zeros as zeros but some of those zeros are going to be 'true' zeros and others are going to be zeros only because they were '1' on the original scale.

I looked through the applied logistic regression text and I'm not seeing the example that others have alluded to.

Thanks for your input.

whauser
  • 349
  • 1
  • 9
  • Also see http://stats.stackexchange.com/questions/6563. With counts there are several other natural transformation options, most of them related to square roots, which avoid the log-of-zero problem. You should be motivated by concerns of goodness of fit in the model more than anything else; having lots of zeros of itself is not necessarily a problem. – whuber Dec 01 '14 at 20:30
  • Many thanks for the helpful advice. I'd like to keep the coefficients interpretable and the transformation (if any) theoretically meaningful so that it is defensible. Square root transformations are not widely used in my field and I rather suspect that reviewers will view it with skepticism (which is unfortunate). I also considered inverse hyperbolic sine but I haven't seen that applied to a predictor variable and am not sure how it would be interpreted. I suspect it would be just as with the log - exponentiate it and consider it percent changes on the geometric mean but am not sure. – whauser Dec 03 '14 at 18:16
  • If you conduct goodness of fit tests that show a significantly better fit using a root compared to a logarithm or no transformation, then that should trump any amount of convention or prejudice. Aren't the data supposed to have the loudest voice? – whuber Dec 03 '14 at 18:19

0 Answers0