0

I was wondering if it is necessary to mean center and set std to 1 to the both my xs and ys in linear regression or doing that to just xs is fine enough.

Lets say I use a different model, say neural networks, is it necessary to standardize. I want to know if it is good to standardize both xs and ys, whatever model I use

user34790
  • 6,049
  • 6
  • 42
  • 64
  • Closely related (but w/o the emphasis on ANNs): [When should you center your data & when should you standardize?](http://stats.stackexchange.com/questions/29781/) – gung - Reinstate Monica Aug 27 '13 at 14:42

2 Answers2

5

In linear regression it is not necessary to mean center or normalize either your x or y variable.

Some people think it improves interpretation; I tend to disagree and prefer the raw units. But both are statistically fine.

The main reason I don't like scaling is that it makes the parameter estimates refer to standard deviations derived from the data set you are using. This seems to me to be less intuitive than the raw units. Some people like scaling because then all the x variables are on the "same" scale (standard deviation).

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    +1 You can just center, without normalizing, and you can center on something other than the mean. This can give the majority of the benefit for interpretation, without the issues of scaling that you point out. – Wayne Aug 27 '13 at 14:07
1

I was wondering if it is necessary to mean center and set std to 1 to the both my xs and ys in linear regression or doing that to just xs is fine enough.

As @Peter already stated, there is no need for such normalization in case of linear regression

Lets say I use a different model, say neural networks, is it necessary to standardize. I want to know if it is good to standardize both xs and ys, whatever model I use

This part of question is much harder, as there is unfortunately no general rule for this. Each model has its own "rules" and requirements.

  • Neural Networks - in general, normalization is not requires, as their general approximation abilities should overcome any scale realted problems. In practise, neural networks have much more parameters that one can see while using some libraries. You set the topology and activation functions ($\tanh$ or sigmoid in most cases), but there are at least two more things, which are heavy scale dependent - initial values of the connections' weights and slope of the activation function. For any kind of data you can find good parameters, but most of the common "rules of thumb", which are implemented in libraries avaliable in most of the programming languages (including matlab) are designed for the standarized inputs (scaled to the $[0,1]$ or $[-1,1]$ interval). It is also quite important due to the neural networks issues with falling into local minima, so some authors suggest also decorrelation and standarization of the data.
  • Support Vector Machines - as this is a geometrical model, here, data scaling is very important. Any "artificial" disproportion in the particular dimensions amplitude leads to the completely wrong model (separating hyperplane will be biased towards dimensions with bigger values). As a result it is strongly suggested to standarize the data before applying SVM, but the exact procedure is an open problem (the most popular are linear "squashing", normalizing to mean 0 and variance 1 and decorellation through square root of inverse of data covariance)
  • Decision tree - there is no need for any preprocessing, as rules found by this classifier are scale independent.

Even these few examples show wide variety of preprocessing requirements. If one does not know the model well, it seems the most safe approach to standarize your data, as induced bias should be much smaller then the model related one when using raw values.

lejlot
  • 4,101
  • 1
  • 12
  • 15