Sometime the distribution of a variable is not normal, either left or right skewed, and people tend to transform the distribution by doing x^2 or log(x), etc to make it look more normal.
Question is why this would help model accuracy?
Thanks,
Sometime the distribution of a variable is not normal, either left or right skewed, and people tend to transform the distribution by doing x^2 or log(x), etc to make it look more normal.
Question is why this would help model accuracy?
Thanks,
The reason is that you are assuming a normal distribution. To explain a bit: in your model you are assuming that some change in X yields a change in Y. Such that increasing X by some value yields a change in Y by some value. But that's not the way real life works in many cases. The idea of "too much of a good thing" is common in real data.
Take this example: the relationship between the height of a corn stalk and the amount of water it is given. This relationship might be linear where I give the corn plant 100 ml of water and this increases the height of the corn by 1 cm. In this case my beta coefficient would probably be .01 (i.e. 1 milliliter yields a .01 centimeter increase in height). But that is not how things in life work.
The relationship between the height of the corn and the amount of water it gets is not linear. In other words, I can't keep giving a corn plant water to make it grown infinitely tall. At a certain point, no matter how much water I give it, it will not grow any taller (suggest a logarithmic relationship). Further, there can be a point where I give the corn plant too much water causing it to wilt and die.
If I were to model the relationship between watering a plant and plant height as linear, that would mean that my model would be misleading. I transform the variable to more accurately depict the relationship between my predictor and dependent variable.
One final thing to caution: your transformation should make sense in the context of the theory of your field. Just transforming a variable so that your model is more elegant is not appropriate. The different relationship might just be due to your sample rather than the true relationship within the population.