5

I have a highly right skewed data set with a large range of values (from 1 ~ 10^6) (can't share the actual data for work related reasons).

When I plot the log of the data instead, the distribution looks a lot more like a normal distribution.

Have I stumbled on a meaningful insight in the data set, or is just a general property of the log transform that it brings the distribution closer to normal?

Akaike's Children
  • 1,251
  • 7
  • 15
  • 2
    I always naively assumed that the log transform works well if your data can be thought of as some constant, times many (more or less) independent factors close to 1. E.g. A guy's salary is 10% above the mean if he has a degree, 5% higher if he's living in a large town, 5% lower if he has health issues... A log transform turns that into a sum of independent small numbers, so you get a normal distribution. – nikie Mar 23 '19 at 12:49
  • @Akaikes See [here](https://stats.stackexchange.com/a/67505/805), [here](https://stats.stackexchange.com/a/87537/805) and particularly [here](https://stats.stackexchange.com/a/278056/805) & [here](https://stats.stackexchange.com/a/107690/805) which indicate that the log-transform won't always make even a right-skewed variate less skew (in absolute terms) than it was. A simple counterexample is the [Maxwell(-Boltzmann) distribution](https://en.wikipedia.org/wiki/Maxwell%E2%80%93Boltzmann_distribution), which is mildly right skew but the log of a Maxwell-variate is more strongly (left) skew. – Glen_b Mar 24 '19 at 02:20

1 Answers1

9

For purely positive quantities a log-transformation is indeed the standard first transformation to try and is very frequently used. It is also done if for regression you want a multiplicative interpretation of coefficients (e.g. doubling/ halving of blood cholesterol).

Of course it will not always make a distribution more normal, e.g. take samples from a N(1000, 1) distribution: any transformation can only make it less normal.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Björn
  • 21,227
  • 2
  • 26
  • 65
  • 7
    Similarly a distribution that is symmetric or left skewed will have its skewness made worse by logarithmic transformation. Consider the not very magnificent seven 1 2 3 4 5 6 7; then their square roots are left skewed and in the logarithms of those are even more left-skewed. – Nick Cox Mar 23 '19 at 09:04