0

I have seen and read several similar questions, but mine pertains specifically to zero rich data.

I will be back transforming my data based on a first order Taylor series approximation. As outlined in

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9023&rep=rep1&type=pdf

from page 7

$$\tilde{X} = exp(\hat{Y}) \\ \tilde{\sigma}_X = \tilde{X} \hat{\sigma}_Y$$

according to Wooldridge 2009 (p. 192)

"The percentage change interpretations are often closely preserved, except for changes beginning at y=0 (where the percentage change is not even defined). Generally, using log(1+y) and then interpreting the estimates as if the variable were log(y) is acceptable when the data contain relatively few zeros."

However, my data is zero rich so I came up with the following?

$$ \tilde{X} = exp(\hat{Y})-1 \\ \tilde{\sigma}_X = \tilde{X} \hat{\sigma}_Y $$

from my understanding because the formula for sample SD is:

$$\sigma_x = \sqrt{\frac{\sum{(X-\bar{X})^2}}{n-1}} $$

and relies on the difference of X and $\bar{X}$ shifting it by any constant c should not affect the standard deviation.

a simple test to show problem with zero rich data.

#log is natural log code is in R
test <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1 2)
mean(test) #mean transformed CASE1
#[1] 0.3333333
exp(mean(log(test + 1))) - 1 #CASE2 mean estimation adjusted for +1
#[1] 0.2300755
exp(mean(log(test + 1))) #CASE3 mean estimation based on exp(y)
#[1] 1.230076

gm_mean = function(a){prod(a)^(1/length(a))} # function from @doug
gm_mean(test) #CASE4 geometric mean
#[1] 0

CASE3 != CASE4 showing that transformation is affecting estimate of mean, as we would expect exp(y) to be equal to the geometric mean of y.

CASE4 and CASE2 are both close to CASE 1 with CASE2 being the best estimate.

This shows that for zero rich distributions there appears to be a discrepancy among different estimations of the mean.

If we perform the same analysis on the following data which does not need transformation

test2 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3)
mean(test2) # CASE1 mean on untransformed 
#[1] 1.333333
exp(mean(log(test2 + 1))) - 1 # CASE2 log trans mean approx exp(y) -1
[1] 1.267067

exp(mean(log(test2))) #CASE3 mean exp(y)
[1] 1.230076
gm_mean(test2) #CASE4 geometric mean equals exp(y) as expected
[1] 1.230076

we find that CASE3 == CASE4 and both are roughly equal to CASE2. CASE 2-4 are all good approximations for the transformed mean.

TL;DR: are $$ \tilde{X} = exp(\hat{Y})-1 \\ \tilde{\sigma}_X = \tilde{X} \hat{\sigma}_Y $$ a good approximation for mean and standard deviation of zero rich ln + 1 transformed data?

edit: fixed sd formula

NicoFish
  • 86
  • 7
  • 1
    Why do you want to transform your data ? Why not fit a zero-inflated model ? – Robert Long Apr 24 '19 at 03:58
  • Unfortunately, you haven't told us what you are actually trying to do do, so no one would be able to provide you with ANY guidance that would be useful at this stage. – StatsStudent Apr 24 '19 at 04:04
  • My data is not zero rich enough to justify and zero-inflated model (though I will look into this for anther project I am doing). Also, I transformed the data after looking at dist and using Box Cox procedure to justify my choice (Lambda was essentially zero). The question is more of a hypothetical. I merely want to know if adapting the back-transformation method from the pdf to adjust for log + 1 is correct. If it is not would it be correct for me to use the method from the included link for data that has been transformed with log+1 – NicoFish Apr 24 '19 at 04:12
  • Again, you haven't told us what it is you are trying to actually do. What is it you are trying to accomplish by taking these manipulations of your data? What is the context? Analysts typically don't just transform data for the fun of it. If we don't know your motivations it would be impossible for us to tell if any "back transformation" method is "correct." – StatsStudent Apr 24 '19 at 04:22
  • @StatsStudent. This is a question about back-transformation of data that has been transformed using log + 1. The analysis is irrelevant. You simply want to get the mean and sd of the transformed data NOT perform an analysis. You would like to report the mean and SE of the transformed data for the benefit of the reader. You cannot simply display mean(log(data + 1)). I would like to know if it is correct to represent the mean as exp(mean(log(data +1))) - 1 instead of exp(mean(log(data +1))). In this case back-transformation has nothing to do with any analysis or end goal. – NicoFish Apr 24 '19 at 05:03
  • Why not use a generalised linear model? – Nick Cox Apr 24 '19 at 06:47
  • Your formula for SD is lacking a summation sign and a square symbol in key places. – Nick Cox Apr 24 '19 at 10:06
  • @NickCox How would I use a generalized linear model to get the mean and SE of my transformed data? If you have any resources I would be grateful – NicoFish Apr 24 '19 at 19:33
  • The question doesn’t really arise, but predictions for the mean response as a function of the predictors are returned on the original scale of the response. Your question seems confused to me as taking the mean only commutes with linear transformations, and all else is approximation. Several texts have generalized linear models as part of their title: I can’t guess which might be most congenial to you, especially as you appear to be an economics student (it’s usually economists who assume that Wooldridge 2009 suffices as a reference). – Nick Cox Apr 24 '19 at 19:47
  • Before you pursue this analysis any further, please see https://stats.stackexchange.com/questions/30728. It shows that Wooldridge is not generally correct and indicates the possibility that your approach might not work. Also take a look at https://stats.stackexchange.com/questions/41361. – whuber Apr 24 '19 at 19:58
  • @NickCox I a Statistics and entomology student (recently graduated). I am working on very large insect trap dataset. The data lends itself very well to log transformation and my PI urges I use this as it is standard for the field. After transformation the variance is sotabilized and the qq plot looks very good. I performed a welschs t.test(TrapCount ~ Near.Or.Far) on transformed data. Now I need to get the mean and SE of Near traps and Far traps. From what I understood I could do this by approximating using Taylor series moments (as in link). However, I did not know how to cope w log+1 trans. – NicoFish Apr 24 '19 at 22:26

0 Answers0