Why transform distribution could help model accuracy?

Question

Sometime the distribution of a variable is not normal, either left or right skewed, and people tend to transform the distribution by doing x^2 or log(x), etc to make it look more normal.

Question is why this would help model accuracy?

Thanks,

As tagging implies, there are many threads here on transformations. This may not be an exact duplicate of any existing thread, but at the same time I don't think there is anything new here. https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribut is good on the most common single case. — Nick Cox, Jun 03 '17 at 14:45
This question is a little out of focus, because transformations are not explicitly to "help model accuracy," nor are they generally to create Normal distributions. Their purposes include [linearizing relationships](https://stats.stackexchange.com/a/35717/919), [achieving symmetric or homoscedastic residuals](https://stats.stackexchange.com/a/4833/919), [improving goodness of fit](https://stats.stackexchange.com/a/41377/919), [expressing relationships additively](https://stats.stackexchange.com/a/86265/919), and [much more](https://stats.stackexchange.com/a/60455/919). — whuber, Jun 03 '17 at 15:53
Thanks @NickCox for the link. I was puzzled at why people are doing this transformation and sometimes improves model accuracy. Both you and Whuber's comments help me understand that the transformation does make relationship linear, symmetric, homoscedstic, which sometimes improve linear models's accuracy because they are based on these assumptions. However If we don't use linear models, these don't quite matter. — frank, Dec 01 '17 at 15:07
And, it only works sometimes because beside linearity, accuracy depends on many other things, (For example model can be perfect linear and have good training score but may give poor predictions if it ignored important confounding factors ). I wish to keep this post for its broader scope. — frank, Dec 01 '17 at 15:07
The other question answers why log transformation help skwed distribution, but this question further answers after transformation, why a symmetric distribution is helpful sometimes to linear models. — frank, Dec 01 '17 at 15:16

score 0 · Answer 1 · answered Jun 03 '17 at 14:32

The reason is that you are assuming a normal distribution. To explain a bit: in your model you are assuming that some change in X yields a change in Y. Such that increasing X by some value yields a change in Y by some value. But that's not the way real life works in many cases. The idea of "too much of a good thing" is common in real data.

Take this example: the relationship between the height of a corn stalk and the amount of water it is given. This relationship might be linear where I give the corn plant 100 ml of water and this increases the height of the corn by 1 cm. In this case my beta coefficient would probably be .01 (i.e. 1 milliliter yields a .01 centimeter increase in height). But that is not how things in life work.

The relationship between the height of the corn and the amount of water it gets is not linear. In other words, I can't keep giving a corn plant water to make it grown infinitely tall. At a certain point, no matter how much water I give it, it will not grow any taller (suggest a logarithmic relationship). Further, there can be a point where I give the corn plant too much water causing it to wilt and die.

If I were to model the relationship between watering a plant and plant height as linear, that would mean that my model would be misleading. I transform the variable to more accurately depict the relationship between my predictor and dependent variable.

One final thing to caution: your transformation should make sense in the context of the theory of your field. Just transforming a variable so that your model is more elegant is not appropriate. The different relationship might just be due to your sample rather than the true relationship within the population.

This answer asserts that we are assuming a normal distribution (of what, precisely) but then morphs into an example where a nonlinear relationship is better off treated by logging the predictor. At least that's what you seem to be suggesting. The question about a normal distribution is never really addressed. I don't think this is answering the question directly. A much shorter answer is that often normality is not a key assumption (better explained as "ideal condition") at all, but rather transformations may help with linearity, additivity and homoscedasticity. — Nick Cox, Jun 03 '17 at 14:41
@NickCox +1 to your comment. But can you please elaborate and expand on how "transformation may help with linearity...". OP's question is specifically about model accuracy, which I interpret to mean better fit of the response. — horaceT, Jun 03 '17 at 15:17
@horaceT That's what the thread I linked to to my comment on the question discusses. Simple examples: if the real pattern for outcome $y$ and predictor $x$ is more like $\exp(a + bx)$ or $a x^b$ consider working on the logarithmic scale (in the second example with $\log x$ as well). (I am interpreting accuracy loosely as meaning finding a better or more appropriate model.) — Nick Cox, Jun 03 '17 at 15:26
@NickCox Agree with your comment. The answer wasn't to the point. — frank, Dec 01 '17 at 15:11

Why transform distribution could help model accuracy?

1 Answers1