0

I have the following distribution:

enter image description here

I have numerous zero values which I change to 1 so I don't have zero or negative numbers before I log transform. Of course, log(1) = 0. So I get the following results:

enter image description here

Of course, the log transform has given me more of a normal distribution except for the lower end of the data. Does the spike mean that performing the transformation is meaningless?

Windstorm1981
  • 314
  • 2
  • 14
  • What problem are you trying to solve? – Sycorax Jan 16 '20 at 17:01
  • This is input data. Many models work best when the input is normally distributed. When faced with an exponential distribution I apply a log transformation. But the lower end value give me the spike in the otherwise more normally distributed data. Therefore, does it make sense to NOT transform this data? – Windstorm1981 Jan 16 '20 at 17:03
  • What model are you using that requires normally-distributed inputs? – Sycorax Jan 16 '20 at 17:05
  • Most parametric models do better when inputs are gaussian. Linear regression, naive bayes – Windstorm1981 Jan 16 '20 at 17:06
  • @SycoraxsaysReinstateMonica should clarify if "scaling" means dividing by SD so that variance=1, linear regression is *not* invariant to power transforms (of which log is a special case). Also a few network based analyses like SEM require normal inputs because prespecified network structures place constraints on the working covariance structure d/t independence assumptions. – AdamO Jan 16 '20 at 17:16
  • @AdamO My interest in this line of questioning is identifying whether OP's question is essentially an XY problem. Right now, I think it might be. I agree that a power transformation is a different OLS model than a model of non-transformed data, but it's not obvious that is the problem that OP wishes to solve. If OP's motivation is "OLS only works for normally-distributed inputs," that seems worth addressing. – Sycorax Jan 16 '20 at 17:23
  • Input data is not "required" to be transformed. However, many models perform better when input data is gaussian. It is common practice to transform non-gaussian input data. Sometimes it improves things - sometimes not. – Windstorm1981 Jan 16 '20 at 17:25
  • 2
    We find many people asking questions predicated on this "common practice"--but it's usually based on the wrong considerations. "Parametric" is not the same as "Gaussian;" response variables usually *ought* to have complex distributions that reflect their explanatory variables; most parametric assumptions pertain to *error* terms, not to the variables themselves; whether and how to transform a variable depends on how it will be used in an analysis; and much more. Thus, you might or might not have a problem here and your approach might make the situation better, worse, or not matter. – whuber Jan 16 '20 at 17:52
  • 2
    Some relevant threads include https://stats.stackexchange.com/questions/30728 (on taking logs of data with zero values), https://stats.stackexchange.com/questions/35711 (on transforming a response variable), and https://stats.stackexchange.com/questions/4831 (on transforming regression variables). – whuber Jan 16 '20 at 17:55
  • I didn't say that "parametric" is the same as "Gaussian". I said that sometimes parametric models perform better with gaussian distributed inputs - hence the reason for the transform. Applying transformations where applicable is a fundamental approach to data preparation prior to model evaluation and selection. – Windstorm1981 Jan 16 '20 at 18:29
  • 1
    @whuber your suggestion https://stats.stackexchange.com/questions/4831/regression-transforming-variables is very helpful – Windstorm1981 Jan 16 '20 at 18:31

2 Answers2

2

If the response is count valued, you should consider using an appropriate modeling strategy that implicitly log-transforms the intensity rather than the count values themselves. Consider that a Poisson process with low intensity is likely to have right skewed results and many 0s, but log-transforms of the data would lead to highly biased estimates of the mean and of the SD, regardless of how the 0 values are handled. Fitting negative binomial or quasipoisson GLM as probabilistic models via MLE is a simple, straightforward, and efficient way to summarize count data where applicable.

A log-transform alternative that has been explored for variables with 0 values is logp1, where 1 is added to every value before applying log transform, and not just to the 0 value. This will decrease the intensity of the truncation, but not satisfactorily. In other words, your approach is not justified regardless of the distribution's appearance post hoc. The resulting scatterplot is evidence of this.

If the response is continuously valued, the distribution has mixture properties and mixture modeling is one approach to estimating a density with better precision that the proposed approach.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
AdamO
  • 52,330
  • 5
  • 104
  • 209
0

One possibility is to separate your data into 2 variables. One with an indicator of the value is zero or non zero. And the other being value of the variable given a non zero variable. And then apply the transformation. On a separate note you can try using power transform/box Cox transform.

  • Boxcox with unspecified Lambda on this series gave a value of 0.04 - which is virtually the log transformation. – Windstorm1981 Jan 16 '20 at 17:27
  • True. But it should be remembered that Box Cox transformations have a limitation that data should be positive which is applicable for log transformation as well. So better approach is to remove non positive data points and keep them in a separate variable which should make your data probably an exponential. – Charles Smith Jan 16 '20 at 17:38