Confused about taking the log to make variables appear normally distributed

Question

So for a project I am using SPSS and creating a mean test score of three different tests. Since the tests use different scales, before creating the mean, I am saving them as standardised residuals, after controlling for confounders such as 'time taken' to do the test.

Now my confusion is this: Two of the tests have a 'time taken' that isn't normally distributed, and the third one is normally distributed. When I take the log of the two non-normal 'time taken's, they fit a normal distribution.

So is it correct to control for the time taken for two of the tests as log variables, and then to control for the third one not as a log? Or do they all need to be consistent because I am creating a mean score of the three for the rest of my analysis?

Once I create the mean of the three test scores, i'm using the mean as a dependent in a multiple linear regression and F-test. — Sam123, Mar 24 '17 at 14:04
Interesting twist on a common question. But please clarify whether your creation of the mean occurs before or after the regression. If after: when you save the standardized residuals, you are saving information not about the scores themselves but about their mean's adjusted relationship with some other variable in the regression. So you no longer are working with an indicator on the original domain. — rolando2, Mar 24 '17 at 15:37
@rolando2 To clarify, I will first do three linear regressions with the test score as the 'dependent', and here I will control for the 'time taken' and other possible confounders, and save each of these as a standardised residual score. Then I will create a mean of these three standardised residuals, and this mean score will be used as the 'dependent' in a multiple linear regression model and some F-tests. I hope I am being clear — Sam123, Mar 24 '17 at 17:59
My earlier comment may well have been wrong...the method you describe is idiosyncratic but then again your situation is pretty unique. I just hope you'll be able to defend your approach to any possible critics that matter. — rolando2, Mar 24 '17 at 19:03
@rolando2 so does that mean that either approach could work if I justified it? How would one justify converting the one normally distributed variable to a logarithm? It doesn't make sense to me because it was already normal, so I'm not sure how I would explain that it was just transformed for consistency. — Sam123, Mar 24 '17 at 21:17

score 2 · Answer 1 · answered Feb 03 '20 at 23:05

In plain language, what log transformations do is that they squish the right tail of a given distribution. For example, imagine that you have some data about people's income. These tend to be positively skewed - most people will earn at or below the median wage, but they you have the few millionaires who earn 10- or 100-times more than the median.

If you try to model the income variable untransformed with an OLS linear model (e.g. linear regression, ANOVA), the model won't probably give you a great answer. The reason is that the model is trying its best to minimize the sum of squared distances from the trend. Since the millionaires may be 10- or 100-times away from the average person in terms of income, they will also be very far from the "trend" as you or I would think of it, and therefore the model will give them a much bigger weight than to the other people. Essentially, the model thinks that being couple $100,000 wrong for a few millionaires is more severe than being couple \$10,000 wrong for the average person. That may not be the case however - for the average person, a difference of \$10,000 in yearly earnings may be huge, whereas for a millionaire, \$100,000 here or there might not make as much of a difference.

When you log transform, you squish exponential differences into linear differences. So, e.g., when you take a log with a base 10, the difference between \$10,000 and \$100,000 (2 - 3 = 1) is the same as the difference between \$100,000 and \$1,000,000 (3 - 4 = 1). I.e. the difference between a very poor person and someone in the upper-middle/lower-upper class is similar to the difference between upper-middle/lower-upper class and a super wealthy person (without transformation, the model would think that the upper-middle/lower-upper and poor person are MUCH, MUCH more similar than the upper-middle/lower-upper and the super wealthy).

If you model with log transformed variables, you have to keep in mind that the transformed predictors are no longer on the original scale. So a difference of 1 on a log-transformed scale corresponds to an exponential, or multiplicative difference. Whether or not a log-transformation is a good idea is a tricky question that depends on the context - you should ask yourself if it makes sense and if you can interpret the predictor after. It can help with extremely skewed data though, e.g. reaction times.

score 0 · Answer 2 · answered Mar 24 '17 at 20:46

0

Remember, in your regression it is the error term that should be normal, not the dependent variable.

Second, you would not want to take the mean of two logged variables and one not as that would give very uneual weights.

answered Mar 24 '17 at 20:46

JKP

1,349
10
7

4

Your last argument would seem to imply one should never mix logarithms and non-logarithms as explanatory variables in a regression, but clearly that's unnecessarily restrictive. You seem to be making some implicit assumptions about the actual test scores here, but what are they? – whuber Mar 24 '17 at 20:58
@whuber do you think in the scenario of my project it would then be okay to mix logarithms and non-logarithms? Or are there only specific cases where this can be done? The reason I am concerned is because if I was to convert the one normally distributed explanatory variable to a logarithm just for consistency, then I'm not sure how I could justify that? – Sam123 Mar 24 '17 at 21:15
1

@whuber My concern is that this amounts to very different weights for the logged and unlogged scores in the averaging, which I should think would require some substantive justification beyond just transforming to normality (which may be a specious concern here anyway). – JKP Mar 25 '17 at 22:43

Confused about taking the log to make variables appear normally distributed

2 Answers2