0

It seems common to apply standard linear regression to variables with long-tail distributions, like GDP, by first taking the log. What is the justification for doing that? Is it effectively assuming a lognormal distribution?

Also, if I have taken the log of an independent variable in a regression, how does that change the interpretation of $R^2$, p, etc?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
elplatt
  • 190
  • 1
  • 5
  • 1
    Often the answer is as simple as it being absurd not to do that because otherwise the fit is all too sensitive to a few very large values. Plotting the data and the regression usually makes this clear. – Nick Cox Sep 25 '14 at 19:02
  • Long-tails are the property of distribution. GDP is a time series, hence stochastic process which is described by the family of distributions, furthermore, it is usually found that GDP is a random walk, i.e. Brownian motion, which is certainly not long-tailed. – mpiktas Dec 18 '14 at 07:50
  • 1
    @mpiktas I'm asking about the distribution of GDPs over all nations at a specific time. – elplatt Dec 19 '14 at 19:10

2 Answers2

2

You don't need to assume a lognormal distribution; there's no requirement that an independent variable in linear regression itself has a normal distribution. The hope is that, with log transformation of the independent variable, the other requirements for interpreting linear regression results will better be met, such as having normally distributed residual errors independent of fitted values.

If the regression against the log-transformed independent variable meets those requirements, there are no problems with interpreting p-values, etc. Regression coefficients will now mean the change in the dependent variable per log change in the independent variable. So if you use log10, the regression coefficient will be "change per 10-fold change in GDP" for your example; for log2, "change per doubling of GDP."

EdM
  • 57,766
  • 7
  • 66
  • 187
  • 2
    Even normally distributed errors isn't a *requirement*; it is just nice for regression if you have that. Even if you want to add it to the list of assumptions, it's the least important assumption. – Nick Cox Sep 25 '14 at 19:00
  • 1
    Granted. For those reading this page, http://stats.stackexchange.com/questions/16381/what-is-a-complete-list-of-the-usual-assumptions-for-linear-regression?lq=1 provides what may be the most complete and highly viewed reference on this website to the assumptions. – EdM Sep 25 '14 at 19:22
  • More thorough discussion of when and why to use log transforms can be found at http://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va – EdM Sep 26 '14 at 17:08
1

long-tail distributions, like GDP,

Why do you think GDP has a long tailed distribution? This is not a common knowledge, as far as I know.

by first taking the log. What is the justification for doing that? Is it effectively assuming a lognormal distribution?

Yes, we assume lognornal distribution, sometimes. If your log-transformed variable is normal, then the variable would be lognormal, but you don't have to assume normal distribution.

It's still a debate whether GDP is a unit-root process or time-stationary though. For instance, some may think that $\ln GDP_i=A+Bi+\varepsilon$, i.e. time-stationary.

Log-transform is used most commonly for two main reasons.

One is when you see that the variance of the series increases at higher levels. In this case often log or Box-Cox transformations are applied.

The second reason is when you think that your series have a constant growth rate, such as geometric brownian motion: $GDP_i=GDP_{i-1}e^r$, where $r$ is the growth rate. If you take the log, then this turns into a nice linear equation: $\ln GDP_i-\ln GDP_{i-1}=r+\varepsilon_i$.

Also, if I have taken the log of an independent variable in a regression, how does that change the interpretation of R2, p, etc?

$R^2$ would be on fitting the log, i.e. your dependent variable would be $\ln\frac{GDP_i}{GDP_{i-1}}$, and all your OLS diagnostics would be on this variable.

Take a look at this paper: The Role of the Log Transformation in Forecasting Economic Variables, Helmut Luetkepohl, Fang Xu

Aksakal
  • 55,939
  • 5
  • 90
  • 176