How to handle skewed data and response variable when predicting

Question

My data contains some skewed features, and also the response variable (sale price) is also skewed.

Log transforming all relevant features and the response variable is good enough and 'fixes' the skew.

Questions are:

Should I indeed also log-transform my response variable?
After building my model using the log-transformed training set (e.g. Linear regression model), when wanting to use it to predict Sale Price response variable for a test set, I should log-transform all relevant features in there as well? I think not, but not sure.

You cannot really decide on transformations only from such information (distributional) as given here! Transformations change the *meaning* of a model, so we really need to know to what use the model will be put, what kind of interpretation you are after ... — kjetil b halvorsen, Feb 26 '17 at 17:29
Please see [this thread](http://stats.stackexchange.com/q/298/28500) and many others on this site about log and other transformations, of both features and responses. Skewed features and responses don't matter per se; linearity of the regression relation and the distributions of the _residuals_ about the regression are what matter. — EdM, Feb 26 '17 at 17:35

score 2 · Answer 1 · edited Mar 18 '17 at 20:45

2

For 1) if the response is also skewed, you better log-transform the response variable as well.

For 2) Once you log-transform, your multiple linear regression(if more than 1 predictor) coefficients explains differently than non-transformed coefficients. Ex: $\log Y = a_1 \log(X_1) + a_2 \log(X_2)+\cdots$.

Interpretation would be as "for every one unit increase in $\log (X_1)$, $\log(Y)$ would increase by $a_1$ after adjusting for other predictors. So, you first predict $\log(Y)$, considering $\log (X_1)$ since your model is going to above form. Then take exponential of that value to estimate predicted Y.

edited Mar 18 '17 at 20:45

Michael Hardy

7,094
1
20
38

answered Feb 26 '17 at 17:25

NiroshaR

165
1
7

Thanks but still something isn't clear to me: If I transform Y, and all needed predictors - Does it mean I need to transform them as well in the test set (except Y which isn't available in test set)? If the answer is yes - Does that mean that once I use my model to make a prediction on the test set, I should take the exp of the predicted value as the real prediction? If the answer is No - Could you please further explain what am I missing? Thanks – Adiel Feb 26 '17 at 18:15
1

What's important would be skew in the _residuals_ around the model, not the skew of the response variable itself. Depending on the data, the response variable itself might not need transformation. – EdM Feb 26 '17 at 18:30
@EdM So what your are saying is that if my target variable or the independent variable is skewed, then it is ok to not transform them into Normal distribution? Am I understanding you correctly? – spectre Jan 06 '22 at 10:10
@spectre yes, you're correct. I can't think of a situation when there is a need to transform variables into Normal distributions before regression. You don't even need Normally distributed _residuals_ for ordinary least squares to give the [best linear unbiased estimate](https://en.wikipedia.org/wiki/Gauss–Markov_theorem) of regression coefficients. Do a search on this site for words or phrases like "normality testing" for further discussion. Might be different for things like neural nets, but that's outside my expertise. – EdM Jan 06 '22 at 16:51

How to handle skewed data and response variable when predicting

1 Answers1