Change variable to log transformed or keep original?

Question

A log transformation of the dependent variable is sometimes recommended as a remedy for some cases of non-normal distribution of residuals after fitting a linear regression model. What is the proper method to evaluate whether the transformed variable should be kept transformed or not in further modeling on the same data?

What are you trying to model? It may be better to use a different distribution to model your data, e.g. via a generalized linear model instead of a linear model/ANOVA. — Stefan, Feb 15 '19 at 22:08

score 3 · Answer 1 · answered Feb 16 '19 at 06:25

3

Probably the easiest approach is to simply plot the distribution of your response (or residuals) and check whether or not the distribution looks Gaussian in the original scale compared to the log scale.

For example, here's a distribution of time-to-event data where $\log Y \sim \mathcal{N}(\mu, \sigma^2)$.

answered Feb 16 '19 at 06:25

Tony Duan

31
2

Any advice on how to determine the relative residual normality of the original data vs the log data? – ReneBt Feb 16 '19 at 06:34
2

Hoping that any transformation makes data look normal is usually asking for too much. Indeed, it's an unusual statistical procedure that absolutely requires the data to be very close to normally distributed. For a thread that discusses what transformation of data is trying to achieve, please see https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va/3530#3530. – whuber Feb 16 '19 at 20:18
@whuber So let's say a log transformation seemed reasonable according to the guidelines in the post you linked to. What is the proper way to evaluate the transformation in order to decide whether it should replace the original untransformed variable in the model? – user31527 Feb 17 '19 at 11:48
@user31527 That would require a lengthy book to answer properly. Some good introductory resources from an exploratory perspective are Tukey's *EDA* and Hoaglin *et al,* *Understanding Robust and Exploratory Data Analysis.* Other approaches include cross-validation and goodness-of-fit tests, depending on the situation, the purposes, and the assumptions. – whuber Feb 17 '19 at 16:03

score 0 · Accepted Answer · answered Feb 17 '19 at 13:36

My preference is to not transform data for statistical reasons, only substantive ones.

If the assumptions of one model are violated, use a different model. It used to be that you more or less had to use linear regression because other methods either had not been developed or were intractable without powerful computers. That is no longer true.

Consider quantile regression and robust regression, for starters.

Change variable to log transformed or keep original?

2 Answers2