What is the reason the log transformation is used with right-skewed distributions?

Question

I once heard that

log transformation is the most popular one for right-skewed distributions in linear regression or quantile regression

I would like to know is there any reason underlying this statement? Why is the log transformation suitable for a right-skewed distribution?

How about a left-skewed distribution?

score 37 · Accepted Answer · edited Apr 13 '17 at 12:44

Economists (like me) love the log transformation. We especially love it in regression models, like this: \begin{align} \ln{Y_i} &= \beta_1 + \beta_2 \ln{X_i} + \epsilon_i \end{align}

Why do we love it so much? Here is the list of reasons I give students when I lecture on it:

It respects the positivity of $Y$. Many times in real-world applications in economics and elsewhere, $Y$ is, by nature, a positive number. It might be a price, a tax rate, a quantity produced, a cost of production, spending on some category of goods, etc. The predicted values from an untransformed linear regression may be negative. The predicted values from a log-transformed regression can never be negative. They are $\widehat{Y}_j=\exp{\left(\beta_1 + \beta_2 \ln{X_j}\right)} \cdot \frac{1}{N} \sum \exp{\left(e_i\right)}$ (See an earlier answer of mine for derivation).
The log-log functional form is surprisingly flexible. Notice: \begin{align} \ln{Y_i} &= \beta_1 + \beta_2 \ln{X_i} + \epsilon_i \\ Y_i &= \exp{\left(\beta_1 + \beta_2 \ln{X_i}\right)}\cdot\exp{\left(\epsilon_i\right)}\\ Y_i &= \left(X_i\right)^{\beta_2}\exp{\left(\beta_1\right)}\cdot\exp{\left(\epsilon_i\right)}\\ \end{align} Which gives us: That's a lot of different shapes. A line (whose slope would be determined by $\exp{\left(\beta_1\right)}$, so which can have any positive slope), a hyperbola, a parabola, and a "square-root-like" shape. I've drawn it with $\beta_1=0$ and $\epsilon=0$, but in a real application neither of these would be true, so that the slope and the height of the curves at $X=1$ would be controlled by those rather than set at 1.
As TrynnaDoStat mentions, the log-log form "draws in" big values which often makes the data easier to look at and sometimes normalizes the variance across observations.
The coefficient $\beta_2$ is interpreted as an elasticity. It is the percentage increase in $Y$ from a one percent increase in $X$.
If $X$ is a dummy variable, you include it without logging it. In this case, $\beta_2$ is the percent difference in $Y$ between the $X=1$ category and the $X=0$ category.
If $X$ is time, again you include it without logging it, typically. In this case, $\beta_2$ is the growth rate in $Y$---measured in whatever time units $X$ is measured in. If $X$ is years, then the coefficient is annual growth rate in $Y$, for example.
The slope coefficient, $\beta_2$, becomes scale-invariant. This means, on the one hand, that it has no units, and, on the other hand, that if you re-scale (i.e. change the units of) $X$ or $Y$, it will have absolutely no effect on the estimated value of $\beta_2$. Well, at least with OLS and other related estimators.
If your data are log-normally distributed, then the log transformation makes them normally distributed. Normally distributed data have lots going for them.

Statisticians generally find economists over-enthusiastic about this particular transformation of the data. This, I think, is because they judge my point 8 and the second half of my point 3 to be very important. Thus, in cases where the data are not log-normally distributed or where logging the data does not result in the transformed data having equal variance across observations, a statistician will tend not to like the transformation very much. The economist is likely to plunge ahead anyway since what we really like about the transformation are points 1,2,and 4-7.

These are standard points but it's very good to have them brought together concisely. Many accounts cover only some of these points. Small point: I think your contrast between economists' attitudes and statisticians' attitudes is a little overdone. For example, the importance of link over error family runs through generalised linear model literature, although it could do with more trumpeting. Keene, Oliver N. 1995. The log transformation is special. _Statistics in Medicine_ 14: 811-819. DOI:10.1002/sim.4780140810 is another example. — Nick Cox, Feb 21 '15 at 10:09

Glen_b · Answer 2 · 2019-06-11T01:02:43.137

First let's see what typically happens when we take logs of something that's right skew.

The top row contains histograms for samples from three different, increasingly skewed distributions.

The bottom row contains histograms for their logs.

enter image description here

You can see that the center case ($y$) has been transformed to symmetry, while the more mildly right skew case ($x$) is now somewhat left skew. One the other hand, the most skew variable ($z$) is still (slightly) right skew, even after taking logs.

If we wanted our distributions to look more normal, the transformation definitely improved the second and third case. We can see that this might help.

So why does it work?

Note that when we're looking at a picture of the distributional shape, we're not considering the mean or the standard deviation - that just affects the labels on the axis.

So we can imagine looking at some kind of "standardized" variables (while remaining positive, all have similar location and spread, say)

Taking logs "pulls in" more extreme values on the right (high values) relative to the median, while values at the far left (low values) tend to get stretched back, further away from the median.

enter image description here

In the first diagram, $x$, $y$ and $z$ all have means near 178, all have medians close to 150, and their logs all have medians near 5.

When we looks at the original data, a value at the far right - say around 750 - is sitting far above the median. In the case of $y$, it's 5 interquartile ranges above the median.

But when we take logs, it gets pulled back toward the median; after taking logs it's only about 2 interquartile ranges above the median.

Meanwhile a low value like 30 (only 4 values in the sample of size 1000 are below it) is a bit less than one interquartile range below the median of $y$. When we take logs, it's again about two interquartile ranges below the new median.

enter image description here

It's no accident that the ratio of 750/150 and 150/30 are both 5 when both log(750) and log(30) ended up about the same distance away from the median of log(y). That's how logs work - converting constant ratios to constant differences.

It's not always the case that the log will help noticeably. For example if you take say a lognormal random variable and shift it substantially to the right (i.e. add a large constant to it) so that the mean became large relative to the standard deviation, then taking the log of that would make very little difference to the shape. It would be less skew - but barely.

But other transformations - the square root, say - will also pull large values in like that. Why are logs in particular, more popular?

I touched on one reason just at the end of the previous part - constant ratios tend to constant differences. This makes logs relatively easy to interpret, since constant percentage changes (like a 20% increase to every one of a set of numbers) become a constant shift. So a decrease of $-0.162$ in the natural log is a 15% decrease in the original numbers, no matter how big the original number is.

A lot of economic and financial data behaves like this, for example (constant or near-constant effects on the percentage scale). The log scale makes a lot of sense in that case. Moreover, as a result of that percentage-scale effect. the spread of values tends to be larger as the mean increases - and taking logs also tends to stabilize the spread. That's usually more important than normality. Indeed, all three distributions in the original diagram come from families where the standard deviation will increase with the mean, and in each case taking logs stabilizes variance. [This doesn't happen with all right skewed data, though. It's just very common in the sort of data that crops up in particular application areas.]

There are also times when the square root will make things more symmetric, but it tends to happen with less skewed distributions than I use in my examples here.

We could (fairly easily) construct another set of three more mildly right-skew examples, where the square root made one left skew, one symmetric and the third was still right-skew (but a bit less skew than before).

What about left-skewed distributions?

If you applied the log transformation to a symmetric distribution, it will tend to make it left-skew for the same reason it often makes a right skew one more symmetric - see the related discussion here.

Correspondingly, if you apply the log-transformation to something that's already left skew, it will tend to make it even more left skew, pulling the things above the median in even more tightly, and stretching things below the median down even harder.

So the log transformation wouldn't be helpful then.

See also power transformations/Tukey's ladder. Distributions that are left skew may be made more symmetric by taking a power (greater than 1 -- squaring say), or by exponentiating. If it has an obvious upper bound, one might subtract observations from the upper bound (giving a right skewed result) and then attempt to transform that.

Thank you Glen_b for this excellent answer. You give us empirical data to illustrate and then give an intuitive explanation for why/how this transformation works. Much appreciated. — Ram, Oct 08 '18 at 19:18

score 8 · Answer 3 · answered Jul 11 '14 at 15:06

The log function essentially de-emphasizes very large values. Look at the image below which shows $y = ln(x)$. Notice how large values on the $x$-axis are relatively smaller on the y-axis.

Now, in a right-skewed distribution you have a few very large values. The log transformation essentially reels these values into the center of the distribution making it look more like a Normal distribution.

Mike Hunter · Answer 4 · 2015-06-11T09:53:19.273

1

All of these answers are sales pitches for the natural log transformation. There are caveats to its use, caveats that are generalizable to any and all transformations. As a general rule, all mathematical transformations reshape the PDF of the underlying raw variables whether acting to compress, expand, invert, rescale, whatever. The biggest challenge this presents from a purely practical point of view is that, when used in regression models where predictions are a key model output, transformations of the dependent variable, Y-hat, are subject to potentially significant retransformation bias. Note that natural log transformations are not immune to this bias, they're just not as impacted by it as some other, similar acting transformations. There are papers offering solutions for this bias but they really don't work very well. In my opinion, you're on much safer ground not messing with trying to transform Y at all and finding robust functional forms that allow you to retain the original metric. For instance, besides the natural log, there are other transformations that compress the tail of skewed and kurtotic variables such as the inverse hyperbolic sine or Lambert's W. Both of these transformations work very well in generating symmetric PDFs and, therefore, Gaussian-like errors, from heavy-tailed information, but watch out for the bias when you try to bring the predictions back into the original scale for the DV, Y. It can be ugly.

edited Jun 11 '15 at 09:53

answered Jun 11 '15 at 01:40

Mike Hunter

9,682
2
20
43

3

This seems to end focused on what to do with heavy-tailed distributions (by kurtotic you mean possessing high kurtosis). I think you need to spell out how that relates to the question. Similarly, how Lambert's $W$ relates to the question isn't clear. I don't get how transformation bias is less of a possible problem for the logarithmic transformation than for related transformations (which ones?) as in this respect and in others the logarithmic behaves as you would expect as a member of a wider family, for example in being intermediate in effect between the square root and the reciprocal. – Nick Cox Jun 11 '15 at 07:28
@NickCox Thank you for your comment. I agree, my answer wasn't immediately obvious in how it relates to the question. I've edited my answer to reflect the fact that "skewness" isn't only about that since it's also about kurtosis or, more generally, heavy-tailed distributions. Then, stepping back from the very narrow concern with the natural log tranform, my answer is a cautionary note about retransformation bias, an issue that is way, way too often ignored or played down in the senseless search for Gaussian residuals. – Mike Hunter Jun 11 '15 at 09:56
1

Many of your misgivings can be allayed by using a log link in a generalized linear model rather than a log transformation of the response. That matches very well the idea that predictions are wanted on the original scale. I'd be interested in references and/or links on Lambert's $W$ as a transformation, including how you invert it. – Nick Cox Jun 11 '15 at 10:55
@NickCox Good point about the log link function. Here's an article on Lambert's *W* transformation. http://www.hindawi.com/journals/tswj/aa/909231/ – Mike Hunter Jun 11 '15 at 10:58
1

This is good stuff, Mike--but it doesn't answer the question. According to our site policies, it needs to be edited so that it does address the question, turned into a comment (but it's too long for that), or deleted. I would like to suggest a fourth option: would you consider turning it into a blog post? See http://stats.blogoverflow.com/. – whuber Jun 11 '15 at 13:57
@whuber As usual, I find myself chafing at the narrow and procrustean nature of this site's rules and policies. I really don't agree with you guys -- or some of the other SE moderators for that matter. No matter. As for a blog post, it wasn't obvious from the link provided how this would be done. Guess that means deleting this. I'm beyond caring at this point. – Mike Hunter Jun 11 '15 at 14:57
3

We all chafe at various aspects of the rules, but many of us continue to interact here because we have come to see the wisdom of them and have found constructive ways to work around the apparent restrictions. This rule is fundamental: a post that does not answer a question doesn't belong. It tends to keep each thread coherent, limited, clean, and on-topic. It is key to creating material that tends to be more useful and interesting than you will find on any other Q&A site. – whuber Jun 11 '15 at 15:40
(Continued) To work around this rule, you can (a) comment; (b) edit the post to show how it answers the question; (c) encourage the OP to change the question; (d) post a new question--the one you want to answer--and answer it; (e) chat; (f) blog; (g) find a related question that your post *does* answer. If that's not flexible enough for you, then--I hate to say it because I would hate to lose you--you ought to be hanging out somewhere else. – whuber Jun 11 '15 at 15:41
BTW, you're quite right that information about using the blog is scarce. By following the "About" link from the blog site I eventually found the entry at https://stackexchange.com/oauth?client_id=448&scope=no_expiry&redirect_uri=http%3A%2F%2Fstats.blogoverflow.com%2Fwp-content%2Fplugins%2FseAuthentication%2FseAuthenticationLogin.php&state=http%3A%2F%2Fstats.blogoverflow.com%2Fwp-admin%2F||. It's a sign-up link that, after you sign up once, will direct you to the main "dashboard." – whuber Jun 11 '15 at 15:45
@whuber Fair enough...everyone votes with their feet sooner or later anyway. Besides, rules and policies were meant to be broken in my book. And if not broken, then stretched. – Mike Hunter Jun 11 '15 at 15:46
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/24733/discussion-between-mike-hunter-and-whuber). – Mike Hunter Jun 11 '15 at 19:41
3

You have tinkered with this but in my view it remains very problematic as an answer. 1. You're broadening the question in several ways, e.g. by bringing in heavy-tailed distributions as well. That can be a reasonable thing to do in some threads, but here is a well-focused thread with high-quality answers and the extra answer here is by and large muddying the waters. When there are existing good answers to a question there has to be a really good reason for a new answer. – Nick Cox Jun 11 '15 at 23:46
4

2. The assertions about transformation bias remain arm-waving; there's no technical precision to the answer that matches the claims, including the mysterious assertion that the log is less problematic than other similar transformations. – Nick Cox Jun 11 '15 at 23:46
4

3. The detail about Lambert's $W$ remains cryptic. More broadly, the message is that transformations are dubious except that asinh and Lambert's can be good. This seems contradictory and isn't well explained. You are clearly very knowledgeable but this needs a straighter expository style to be valuable. Hence I can't upvote this in good conscience. Your earlier decision to remove it was better in my view. Here and elsewhere I don't think you're quite catching CV style: there isn't a rigid prescription but answers have to be focused; chatty, discursive posts don't usually fit well. – Nick Cox Jun 11 '15 at 23:51
Otherwise put, my reaction seems to match that of @whuber. – Nick Cox Jun 11 '15 at 23:52

score 1 · Answer 5 · answered Jun 10 '19 at 23:03

Many interesting points have been made. A few more?

1) I would suggest that another issue with linear regression is that the 'left hand side' of the regression equation is E(y) : the expected value. If the error distribution is not symmetrical, then merits for the study of the expected value are weak. The expected value is not of central interest when the errors are asymmetrical. One could explore quantile regression instead. Then the study of, say, the median, or other percentage points might be worthy even if the errors are asymmetrical.

2) If one elects to transform the response variable, then one may wish to transform one of more of the explanatory variables with the same function. For example, if one has a 'final' outcome as response, then one might have a 'baseline' outcome as an explanatory variable. For interpretation, it makes sense the transform 'final' and 'baseline' with the same function.

3) The main argument for transforming an explanatory variable is often around the linearity of the response - explanatory relationship. These days, one can consider other options like restricted cubic splines or fractional polynomials for the explanatory variable. There is certainly often a certain clarity if linearity can be found though.

What is the reason the log transformation is used with right-skewed distributions?

5 Answers5

Linked

Related