14

I have an outcome variable that is right skewed, so I log transformed it. I made a null model with only the log-transformed outcome variable, but when I exponentiate the estimate, it does not equal the mean.

Concerned it was issues with my data, I made a sample data set and found the same discrepancy. Why is this? What does the intercept represent in this model?

Here is the sample data and R code:

library(tidyverse)
test <- tibble(salary = c(10000, 23244, 2222222, 2353, 2353463, 5464564),
               perf = c(4, 2, 4, 2, 5, 7))

Here's my null model:

summary(lm(log(salary) ~ 1 , data = test))

The intercept equals 11.971, which when I use exp(11.971), I get 158102.7:

exp(11.971)

But the mean is 1679308:

mean(test$salary)

And, as a sanity check, when I don't log transform the outcome, the intercept does produce the mean:

summary(lm(salary ~ 1 , data = test))

I'd appreciate 1) how to interpret the intercept, 2) why it doesn't equal the mean, and 3) how I could get non-log predictions from this model.

Silverfish
  • 20,678
  • 23
  • 92
  • 180
J.Sabree
  • 243
  • 1
  • 4
  • 3
    Another easy way to see why this does not work is that while `exp(log(mean(x)))` is equal to `mean(x)`, `exp(mean(log(x)))` is not. – Frans Rodenburg Mar 03 '21 at 07:51
  • This issue is also handled in hyndman's `forecast` package for a wider class of transformation (box-cox transformation) of dependent variable. See here: https://otexts.com/fpp2/transformations.html#mathematical-transformations – Dayne Mar 04 '21 at 04:42
  • Other relevant links from CV: (1) https://stats.stackexchange.com/questions/359088/correcting-log-transformation-bias-in-a-linear-model; (2) https://stats.stackexchange.com/questions/69613/bias-correction-of-logarithmic-transformations?rq=1 – Dayne Mar 04 '21 at 04:45

1 Answers1

19

This is a consequence of Jensen's Inequality. You want $E[y|x]$, but exponentiating the predicted value(s) from the log model will not provide unbiased estimates of $E[y|x]$, as $$E[y_i|x_i] = \exp(x'\beta) \cdot E[\exp(u_i)]$$ and the second term is omitted in your calculation.

If the error term $u \sim N[0,\sigma^2]$, then $E[\exp(u)] = \exp(\frac{1}{2}\sigma^2)$. That quantity may be estimated by replacing $\sigma^2$ with its consistent estimate $s^2$ from the regression model.

Alternatively, Duan (1983) shows that for $iid$ errors (which need not be Normal), $$E[\exp(u)] = \frac{1}{N} \sum_i \exp(e_i),$$ where $e_i$ are the residuals.

I've implemented Duan's Smearing Transformation below. Essentially, you need to multiply the exponentiated mean by the average of the exponentiated residuals:

library(tidyverse)
test <- tibble(salary = c(10000, 23244, 2222222, 2353, 2353463, 5464564),
             perf = c(4, 2, 4, 2, 5, 7))
m<-lm(log(salary) ~ 1 , data = test)
mean(exp(m$fitted.values))*mean(exp(m$residuals))
mean(test$salary)

This will work even if you have covariates in the model, though you will have to tweak the calculation a bit since the predictions will now vary across observations:

mean(exp(m$fitted.values)*exp(m$residuals))

This second version should also work in your intercept-only example.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • I know you said this would work if I had covariates in the model, but to be clear, if I wanted to convert the impact of the IV perf, the for the Duan transformation, I would do exp(m$coefficients))*mean(exp(m$residuals)), correct? – J.Sabree Mar 03 '21 at 03:41
  • 1
    I did the simple intercept-only case to simplify things and match your calculation. The second one should work. – dimitriy Mar 03 '21 at 03:48