Why the does the intercept of my null model not equal the mean when I log transform the outcome variable? How do I interpret it?

Question

I have an outcome variable that is right skewed, so I log transformed it. I made a null model with only the log-transformed outcome variable, but when I exponentiate the estimate, it does not equal the mean.

Concerned it was issues with my data, I made a sample data set and found the same discrepancy. Why is this? What does the intercept represent in this model?

Here is the sample data and R code:

library(tidyverse)
test <- tibble(salary = c(10000, 23244, 2222222, 2353, 2353463, 5464564),
               perf = c(4, 2, 4, 2, 5, 7))

Here's my null model:

summary(lm(log(salary) ~ 1 , data = test))

The intercept equals 11.971, which when I use exp(11.971), I get 158102.7:

exp(11.971)

But the mean is 1679308:

mean(test$salary)

And, as a sanity check, when I don't log transform the outcome, the intercept does produce the mean:

summary(lm(salary ~ 1 , data = test))

I'd appreciate 1) how to interpret the intercept, 2) why it doesn't equal the mean, and 3) how I could get non-log predictions from this model.

Another easy way to see why this does not work is that while `exp(log(mean(x)))` is equal to `mean(x)`, `exp(mean(log(x)))` is not. — Frans Rodenburg, Mar 03 '21 at 07:51
This issue is also handled in hyndman's `forecast` package for a wider class of transformation (box-cox transformation) of dependent variable. See here: https://otexts.com/fpp2/transformations.html#mathematical-transformations — Dayne, Mar 04 '21 at 04:42
Other relevant links from CV: (1) https://stats.stackexchange.com/questions/359088/correcting-log-transformation-bias-in-a-linear-model; (2) https://stats.stackexchange.com/questions/69613/bias-correction-of-logarithmic-transformations?rq=1 — Dayne, Mar 04 '21 at 04:45

score 19 · Accepted Answer · edited Mar 03 '21 at 15:45

This is a consequence of Jensen's Inequality. You want $E[y|x]$, but exponentiating the predicted value(s) from the log model will not provide unbiased estimates of $E[y|x]$, as $$E[y_i|x_i] = \exp(x'\beta) \cdot E[\exp(u_i)]$$ and the second term is omitted in your calculation.

If the error term $u \sim N[0,\sigma^2]$, then $E[\exp(u)] = \exp(\frac{1}{2}\sigma^2)$. That quantity may be estimated by replacing $\sigma^2$ with its consistent estimate $s^2$ from the regression model.

Alternatively, Duan (1983) shows that for $iid$ errors (which need not be Normal), $$E[\exp(u)] = \frac{1}{N} \sum_i \exp(e_i),$$ where $e_i$ are the residuals.

I've implemented Duan's Smearing Transformation below. Essentially, you need to multiply the exponentiated mean by the average of the exponentiated residuals:

library(tidyverse)
test <- tibble(salary = c(10000, 23244, 2222222, 2353, 2353463, 5464564),
             perf = c(4, 2, 4, 2, 5, 7))
m<-lm(log(salary) ~ 1 , data = test)
mean(exp(m$fitted.values))*mean(exp(m$residuals))
mean(test$salary)

This will work even if you have covariates in the model, though you will have to tweak the calculation a bit since the predictions will now vary across observations:

mean(exp(m$fitted.values)*exp(m$residuals))

This second version should also work in your intercept-only example.

I know you said this would work if I had covariates in the model, but to be clear, if I wanted to convert the impact of the IV perf, the for the Duan transformation, I would do exp(m$coefficients))*mean(exp(m$residuals)), correct? — J.Sabree, Mar 03 '21 at 03:41
I did the simple intercept-only case to simplify things and match your calculation. The second one should work. — dimitriy, Mar 03 '21 at 03:48

Why the does the intercept of my null model not equal the mean when I log transform the outcome variable? How do I interpret it?

1 Answers1