Reporting regression statistics after logarithmic transformation

Question

I'm a bit troubled about how to report linear regression statistics after log transformation of the dependent variable.

I suppose I should report the transformed coefficient, but would they be easily interpreted?

In the regression plot should I use the transformed values or the original ones? while the seconds would be clearer than it would make no sense to plot the regression line produced after the transformation.

Any direction?

dimitriy · Accepted Answer · 2014-04-09T01:21:44.120

In a log-linear model of an outcome $\ln y$ with a continuous untransformed explanatory variable $x$ and a dummy explanatory variable $d$:

$100 \cdot \beta_x$ is the percentage change in $y$ for a small change in $x$ (up or down)
If d switches from 0 to 1, the percent change in $y$ is $100 \cdot [\exp(\beta_d) - 1]$.
If d switches from 1 to 0, the percent change in $y$ is $100 \cdot [\exp(-\beta_d)-1]$

Personally, I find this semi-elasticity interpretation much easier to follow than a multiplicative effect on the geometric baseline mean (exponentiated intercept) for the dummy variable, and the ratio of $\frac{\mathbf E[y \vert x+1]}{\mathbf E[y \vert x]}=\exp \beta_x$. If $y$ was a ratio, maybe this would make more sense.

For the graphs, you can plot two lines of re-transformed $y$ against x, one with $d=1$ and one with $d=0$:

\begin{equation}E[y \vert x]=\exp (\alpha +\beta_x \cdot x +\beta_d \cdot d) \cdot E[\exp (u)].\end{equation}

The second part of this expression is the hard part. If we assume normality and independence, we can approximate the second term with $\exp (\frac{\hat \sigma^2}{2}),$ where we use the RMSE from the logged regression for the unobserved $\sigma$. Or we can use a weaker assumption of $iid$ on $u_i$, and use the sample average of the exponentiated residuals from the logged model for the second term. That's the Duan "smearing" approach. It might make sense to take two averages: one for the $d=1$ observations and one for $d=0$ if you have reasons to believe there's heteroskedacity across the two groups.

Finally, all this re-transformation nonsense can also be avoided by using a GLM model.

Here's an example using Stata:

. sysuse auto, clear
(1978 Automobile Data)

. gen lnp=ln(price)

. reg lnp i.foreign mpg

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =   17.80
       Model |  3.74819416     2  1.87409708           Prob > F      =  0.0000
    Residual |  7.47533892    71  .105286464           R-squared     =  0.3340
-------------+------------------------------           Adj R-squared =  0.3152
       Total |  11.2235331    73  .153747029           Root MSE      =  .32448

------------------------------------------------------------------------------
         lnp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |   .2824445   .0897634     3.15   0.002     .1034612    .4614277
         mpg |  -.0421151   .0071399    -5.90   0.000    -.0563517   -.0278785
       _cons |     9.4536   .1485422    63.64   0.000     9.157415    9.749785
------------------------------------------------------------------------------

The foreign price premium is 32% and significant:

. nlcom 100*(exp(_b[1.foreign])-1)

       _nl_1:  100*(exp(_b[1.foreign])-1)

------------------------------------------------------------------------------
         lnp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _nl_1 |   32.63681   11.90594     2.74   0.006     9.301603    55.97202
------------------------------------------------------------------------------

Here are the exponentiated coefficients:

. reg, eform(b)

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =   17.80
       Model |  3.74819416     2  1.87409708           Prob > F      =  0.0000
    Residual |  7.47533892    71  .105286464           R-squared     =  0.3340
-------------+------------------------------           Adj R-squared =  0.3152
       Total |  11.2235331    73  .153747029           Root MSE      =  .32448

------------------------------------------------------------------------------
         lnp |          b   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |   1.326368   .1190594     3.15   0.002     1.109003    1.586337
         mpg |   .9587594   .0068455    -5.90   0.000     .9452067    .9725066
       _cons |      12754   1894.507    63.64   0.000     9484.509    17150.53
------------------------------------------------------------------------------

The foreign premium is just about identical. The geometric mean price for domestic cars seems pretty high to me, but that is because we are conditioning on mileage (these are the Caddies and Lincolns and Mercuries). Now we implement the Duan's re-transformations approach by hand:

. predict double uhat, residual

. predict double lnyhat, xb

. gen double expuhat = exp(uhat)

. sum expuhat, meanonly

. gen double yhat = r(mean)*exp(lnyhat)

You can also use Chris Baum's levpredict:

. /* Make Sure I Did Things Right */
. levpredict yhat2, duan

. compare yhat yhat2

                                        ---------- difference ----------
                            count       minimum      average     maximum
------------------------------------------------------------------------
yhat=yhat2                     74
                       ----------
jointly defined                74             0            0           0
                       ----------
total                          74

Now for the graph code:

. tw ///
> (line yhat mpg if foreign ==1, sort lcolor(green)) ///
> (line yhat mpg if foreign ==0, sort lcolor(orange)) ///
> (scatter price mpg if foreign==1, mcolor(green) msymbol(Oh) jitter(2)) ///
> (scatter price mpg if foreign==0, mcolor(orange) msymbol(Oh) jitter(2)) ///
> ,legend(label(1 "E[Price|Foreign]") label(2 "E[Price|Domestic]") label(3 "Foreign") label(4 "Domestic") rows(1)) ///
> ytitle("Dollars") title("Duan Smearing In Action") ///
> ylab(, angle(horizontal) format(%9.0fc)) plotregion(fcolor(white) lcolor(white)) graphregion(fcolor(white) lcolor(white)) ///
>

Looks reasonable:

enter image description here

There's a paper by Lutkepohl where he shows that the volatility adjustment in log transformed series makes the forecast worse because the true variance is unknown — Aksakal, Apr 09 '14 at 01:13
@alexpghayes https://doi.org/10.1016/j.ijforecast.2010.11.003 — Aksakal, Mar 21 '20 at 03:10

score 3 · Answer 2 · answered Apr 08 '14 at 21:51

3

Use transformed variables on plots, and show the coefficients of transformed variables. For instance, if you use $y'=\log y$, then show everything for $y'$ not $y$.

If you would like to inverse the transform, then be very careful. For instance, $y'\sim\mathcal{N}(0,1)$ will have $E[y']=0$ but $E[y]=E[e^{y'}]=\sqrt{e}$. It was easy in the case of a $\log$ transform, but can become tricky with other function, so it's better to avoid it unless you're specifically asked.

answered Apr 08 '14 at 21:51

Aksakal

55,939
5
90
176

A more general statement is that in a OLS model $\ln y = x' \beta + u$, we have that $\mathbf E[y \vert x] = \exp (x'\beta ) \cdot \mathbf E[\exp (u)]$. If $u \sim \mathcal{N}[0,\sigma^2]$, then $\mathbf E[\exp (u)] = \exp (\frac{1}{2} \sigma^2)$, which is where the $e^{0.5}$ comes from. To estimate, just replace with $\sigma^2$ with its consistent estimate $s^2$. If the errors are not normal, though still *iid*, you can use the average of the exponentiated residuals to estimate $\mathbf E[\exp (u)]$. – dimitriy Apr 08 '14 at 22:46
Eventually I decided to show coefficients as _e^β_ explaining in methods that it represent the incremental effect of the variable on the baseline, while _e^intercept_ is the geometrical mean at the baseline. Since the audience of the paper is medical, it's necessary to explain in a easy way :) – Bakaburg Apr 08 '14 at 23:06
Do the transformed coefficients have a special name when reporting them? – oort Jul 22 '15 at 03:52
@oort, if you take the derivative of the transformed equation in respect to the original variables, you'll see that the coefficients of transformed are the rate of changes in percentages – Aksakal Jul 22 '15 at 13:00
So you report those as the "rates of change"? My question is what is the proper name for these. – oort Jul 22 '15 at 13:02
@oort, "percentage change in the dependent variable per unit change in the independent variable". for instance, if you're modeling the stock prices, you can take a log of a price and regress on the interest rate, then the coefficient at the rate will be "stock *return* change per rate change" – Aksakal Jul 22 '15 at 13:06

Reporting regression statistics after logarithmic transformation

2 Answers2

Linked

Related