How reliable is a linear model on log-transformed data

Question

I have collected timing data in which the residuals are non-normally distributed. I log-transformed the data, and then conducted a linear mixed-model regression analysis. (The residuals from the log-transformed data are much "more" normal, but not normal.) The results show a significant difference between two conditions, which is what I was hoping for. However, if not log-transforming the data, the difference is no longer significant. In addition, if not log-transforming the data, but instead removing outliers (which then results in normally distributed data), the difference is also no longer significant.

IV's: time (raw) or log-transformed time
DV's: categorial and numerical
Random effects: Per person (6 times per person)

I’m hoping to get the stats community's opinion as I'm not a statistician. Would you rely on the log-transformation? Or would you rather use a non-parametric method? Or, third idea, use the linear regression on non-normally distributed data and later check residuals?

Residuals from raw data

Residuals from log-transformed data

The thing with log transformed data (in order to correct for distribution of errors) is that it does not only change the random error, but also the deterministic part (the model of the mean) $$ y = \beta X + \epsilon $$ is a different model from $$log (y) = \beta X + \epsilon $$ when you use a GLM (generalized linear model) then you can treat the deterministic and random parts independently. — Sextus Empiricus, Sep 12 '19 at 10:08

mkt · Answer 1 · 2019-09-11T20:04:40.710

5

1) You do not need the raw data to be normally distributed. It's only the residuals that need to be.

2) Removing 'outliers' is generally a bad idea unless you have very good reason to believe that those data points are invalid for some reason, such as instrument failure.

3) If the residual distribution is actually a problem, you can still avoid log transformation by changing the assumed distribution from Gaussian to something else using a generalized linear mixed model. Transformation might be fine, though.

4) ~~To address the question in your title, which is a bit different from the text of your question: the validity of the data has nothing to do with whether it is transformed or not.~~ [Removed after title changed]

edited Sep 11 '19 at 20:04

answered Sep 11 '19 at 19:31

mkt

11,770
9
51
125

1

Thank you! The residuals are non-normally distributed for time-elapsed, and they're almost normally distributed for log-normal data. I'm going to edit my question – Amanda Sep 11 '19 at 19:42
As a follow up question, it seems like per 3), you think transformation isn't optimal. Do you think fitting a different model is better than transforming data? – Amanda Sep 11 '19 at 19:50
1

@Amanda Checking residual distributions for mixed models is a bit more complex. See here: https://stats.stackexchange.com/q/77891/121522 – mkt Sep 11 '19 at 19:56
1

@Amanda Transformation can frequently be fine (I do it all the time). But there are exceptions. This thread and the linked threads in the comments are worth reading. https://stats.stackexchange.com/questions/18844/when-and-why-should-you-take-the-log-of-a-distribution-of-numbers – mkt Sep 11 '19 at 19:58

score 1 · Answer 2 · answered Sep 12 '19 at 11:58

You ask:

Would you rely on the log-transformation? Or would you rather use a non-parametric method?

for me, that would depend on what the variables are and what my question is. Does taking the log make substantive sense? It often does make sense for variables involving money (such as income, wealth, expenditures) because we tend to think of those variables multiplicatively rather that additively - that is, the difference between a salary \$20,000 and \$25,000 is much larger than between \$200,000 and \$205,000. So, if log transform makes sense, do that.

If log transform doesn't make sense, then I'd try a method that doesn't assume normal residuals - e.g. quantile regression.

How reliable is a linear model on log-transformed data

2 Answers2