2

I'm working with a dataset where the dependent variable is continuos (sale price of houses) and there a couple dozen features I'm using to predict the sale price using a linear regression model. These features include binary dummy variables, categorical, and continuous variables - all on different scales.

The dependent variable (sale price) is skewed, so I've instead created a new feature that is log(salePrice) so the distribution is centered. My question is, I had planned on using SckiKit-Learn's StadardScaler class on the explanatory features. Does it make sense to use two different preprocessing techniques, or should I simply use the log of all the explanatory features like I do with the dependent variable?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
idclark
  • 215
  • 1
  • 5

1 Answers1

3

Threre is no reason to require that the predictor variables should be transformed in the same way as the $Y$-variable. Depending on the nature of the variables, such a requirement make not even make sense! Like, as in your case, some of the explanatory variables are dummys---does not make much sense to transform dummys. Scale differences between the $Y$-variable and predictors are taken care of by the estimation algorithm.

For more information on reasons to transform---or to not transform, see the excelent answers: Why not log-transform all variables that are not of main interest?
Pitfalls to avoid when transforming data?

An answer to an almost identical question: Analysing log and square-root transformed variables

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467