I have seen and read several similar questions, but mine pertains specifically to zero rich data.
I will be back transforming my data based on a first order Taylor series approximation. As outlined in
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9023&rep=rep1&type=pdf
from page 7
$$\tilde{X} = exp(\hat{Y}) \\ \tilde{\sigma}_X = \tilde{X} \hat{\sigma}_Y$$
according to Wooldridge 2009 (p. 192)
"The percentage change interpretations are often closely preserved, except for changes beginning at y=0 (where the percentage change is not even defined). Generally, using log(1+y) and then interpreting the estimates as if the variable were log(y) is acceptable when the data contain relatively few zeros."
However, my data is zero rich so I came up with the following?
$$ \tilde{X} = exp(\hat{Y})-1 \\ \tilde{\sigma}_X = \tilde{X} \hat{\sigma}_Y $$
from my understanding because the formula for sample SD is:
$$\sigma_x = \sqrt{\frac{\sum{(X-\bar{X})^2}}{n-1}} $$
and relies on the difference of X and $\bar{X}$ shifting it by any constant c should not affect the standard deviation.
a simple test to show problem with zero rich data.
#log is natural log code is in R
test <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1 2)
mean(test) #mean transformed CASE1
#[1] 0.3333333
exp(mean(log(test + 1))) - 1 #CASE2 mean estimation adjusted for +1
#[1] 0.2300755
exp(mean(log(test + 1))) #CASE3 mean estimation based on exp(y)
#[1] 1.230076
gm_mean = function(a){prod(a)^(1/length(a))} # function from @doug
gm_mean(test) #CASE4 geometric mean
#[1] 0
CASE3 != CASE4 showing that transformation is affecting estimate of mean, as we would expect exp(y) to be equal to the geometric mean of y.
CASE4 and CASE2 are both close to CASE 1 with CASE2 being the best estimate.
This shows that for zero rich distributions there appears to be a discrepancy among different estimations of the mean.
If we perform the same analysis on the following data which does not need transformation
test2 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3)
mean(test2) # CASE1 mean on untransformed
#[1] 1.333333
exp(mean(log(test2 + 1))) - 1 # CASE2 log trans mean approx exp(y) -1
[1] 1.267067
exp(mean(log(test2))) #CASE3 mean exp(y)
[1] 1.230076
gm_mean(test2) #CASE4 geometric mean equals exp(y) as expected
[1] 1.230076
we find that CASE3 == CASE4 and both are roughly equal to CASE2. CASE 2-4 are all good approximations for the transformed mean.
TL;DR: are $$ \tilde{X} = exp(\hat{Y})-1 \\ \tilde{\sigma}_X = \tilde{X} \hat{\sigma}_Y $$ a good approximation for mean and standard deviation of zero rich ln + 1 transformed data?
edit: fixed sd formula