10

Books and discussions often state that when facing problems (of which there are a few) with a predictor, log-transformimg it is a possibility. Now, I understand that this depends on distributions and normality in predictors is not an assumption of regression; but log transforming does make data more uniform, less affected by outliers and so on.

I thought about log transforming all my continuous variables which are not of main interesr, ie variables I only adjust for.

Is that wrong? Good? Useless?

Glen_b
  • 257,508
  • 32
  • 553
  • 939
Adam Robinsson
  • 2,083
  • 3
  • 19
  • 39

3 Answers3

25

Now, I understand that this depends on distributions and normality in predictors

log transforming does make data more uniform

As a general claim, this is false --- but even if it were the case, why would uniformity be important?

Consider, for example,

i) a binary predictor taking only the values 1 and 2. Taking logs would leave it as a binary predictor taking only the values 0 and log 2. It doesn't really affect anything except the intercept and scaling of terms involving this predictor. Even the p-value of the predictor would be unchanged, as would the fitted values.

enter image description here

ii) consider a left-skew predictor. Now take logs. It typically becomes more left skew.

enter image description here

iii) uniform data becomes left skew

enter image description here

(it's often not always so extreme a change, though)

less affected by outliers

As a general claim, this is false. Consider low outliers in a predictor.

enter image description here

I thought about log transforming all my continuous variables which are not of main interest

To what end? If originally the relationships were linear, they would not longer be.

enter image description here

And if they were already curved, doing this automatically might make them worse (more curved), not better.

--

Taking logs of a predictor (whether of primary interest or not) might sometimes be suitable, but it's not always so.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 3
    Many thanks for this splendid answer. I think that many of us, at least me, needed to see it visualized this way. But do you also agree that right-skewed data should be subject to log-transforming? More than other skews and forms? – Adam Robinsson Dec 25 '14 at 10:28
  • 1
    Not generally, no. Under some very specific conditions, perhaps. – Glen_b Dec 25 '14 at 14:40
  • 1
    I'm also surprised to see that no one mentioned interpretability of the model. If you log transform you dependent variable, it becomes a bit more difficult to interpret the model -- especially for laymen or those without statistical/mathematical backgrounds. For example let's say you had a model that predicted the height of a tree in ft. given the circumference of the trunk in inches. The interpretation of the $\hat{\beta}=0.50$ being , for a one inch increase in the circumference, the mean height of the tree is increased by the log of a half of foot is more cumbersome (continued) – StatsStudent Oct 18 '15 at 07:15
  • (con't) than being able to say, for example, for a one inch increase in the number of trunk circumference circumference, the mean height of the tree is increased by the a half of foot. The later is easier to interpret and easier to calculate in the field without a calculator. – StatsStudent Oct 18 '15 at 07:17
10

In my opinion, it doesn't make sense to perform log transformation (and any data transformation, for that matter) just for the sake of it. As previous answers mentioned, depending on data, some transformations would be either invalid, or useless. I highly recommend you to read the following IMHO excellent introductory material on data transformation: http://fmwww.bc.edu/repec/bocode/t/transint.html. Please note that code examples in this document are written in Stata language, but otherwise the document is generic enough and, thus, useful to non-Stata users as well.

Some simple techniques and tools for dealing with common data-related problems, such as lack of normality, outliers and mixture distributions can be found in this article (note, that stratification as an approach to dealing with mixture distribution is most likely the simplest one - a more general and complex approach to this is mixture analysis, also known as finite mixture models, a description of which is beyond the scope of this answer). Box-Cox transformation, briefly mentioned in the two references above, is a rather important data transformation, especially for non-normal data (with some caveats). For more details on Box-Cox transformation, please see this introductory article.

Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
8

Log transforming does not ALWAYS make things better. Obviously, you can't log-transform variables that achieve zero or negative values, and even positive ones that hug zero could come out with negative outliers if log-transformed.

You should not just routinely log everything, but it is a good practice to THINK about transforming selected positive predictors (suitably, often a log but maybe something else) before fitting a model. The same goes for the response variable. Subject-matter knowledge is important too. Some theory from physics or sociology or whatever might naturally lead to certain transformations. Generally, if you see variables that are positively skewed, that's where a log (or maybe a square root or a reciprocal) might help.

Some regression texts seem to suggest that you have to look at diagnostic plots before considering any transformations, but I disagree. I think it's better to do the best job you can at making these choices before fitting any models, so that you have the best starting point possible; then look at diagnostics to see if you need to adjust from there.

Russ Lenth
  • 15,161
  • 20
  • 53
  • All add that these considerations apply both to important and unimportant predictors. – Russ Lenth Dec 25 '14 at 00:23
  • Thanks @rvl! I' m always confused by the discordance between when and how to choose transformations; books often state that, as you wrote, one needs to check the form of all variables before touching regression. Thanks for providing your insights. – Adam Robinsson Dec 25 '14 at 10:38
  • @rvl, thank you for your answer. Would you log-transform the `snoq` dataset in this [CrossValidated thread](http://stats.stackexchange.com/questions/136144/mixture-of-gaussians-on-log-of-data) (bearing in mind the goal is to fit a mixture of Gaussians)? – Zhubarb Feb 04 '15 at 13:11