11

I would like to run linear regression over a multi-dimensional data set. There exist differences among different dimensions in terms of their magnitude of order. For instance, dimension 1 generally has value range of [0, 1], and dimension 2 has value range of [0, 1000].

Do I need to do any transformation to ensure the data ranges for different dimensions are in the same scale? If it has to, are there any guidance for this kind of transformation?

Macro
  • 40,561
  • 8
  • 143
  • 148
bit-question
  • 2,637
  • 6
  • 25
  • 26

2 Answers2

17

Shifting/scaling variables will not affect their correlation with the response

To see why this is true, suppose that the correlation between $Y$ and $X$ is $\rho$. Then the correlation between $Y$ and $(X-a)/b$ is

$$ \frac{ {\rm cov}(Y,(X-a)/b) }{ {\rm SD}((X-a)/b) \cdot {\rm SD}(Y) } = \frac{ {\rm cov}(Y,X/b) }{ {\rm SD}(X/b) \cdot {\rm SD}(Y) } = \frac{ \frac{1}{b} \cdot {\rm cov}(Y,X) }{ \frac{1}{b}{\rm SD}(X) \cdot {\rm SD}(Y) } = \rho$$

which follows from the definition of correlation and three facts:

  • ${\rm cov}(Y, X+a) = {\rm cov}(Y,X) + \underbrace{{\rm cov}(Y,a)}_{=0} = {\rm cov}(Y,X)$

  • ${\rm cov}(Y,aX) = a {\rm cov}(Y,X)$

  • ${\rm SD}(aX) = a \cdot {\rm SD}(X)$

Therefore, in terms of model fit (e.g. $R^2$ or the fitted values), shifting or scaling your variables (e.g. putting them on the same scale) will not change the model, since linear regression coefficients are related to the correlations between variables. It will only change the scale of your regression coefficients, which should be kept in mind when you're interpreting the output if you choose to transform your predictors.

Edit: The above has assumed that you're talking about ordinary regression with the intercept. A couple more points related to this (thanks @cardinal):

  • The intercept can change when you transform your variables and, as @cardinal points out in the comments, the coefficients will change when you shift your variables if you omit the intercept from the model, although I assume you're not doing that unless you have a good reason (see e.g. this answer).

  • If you're regularizing your coefficients in some way (e.g. Lasso, ridge regression), then centering/scaling will impact the fit. For example, if you're penalizing $\sum \beta_{i}^{2}$ (the ridge regression penalty) then you cannot recover an equivalent fit after standardizing unless all of the variables were on the same scale in the first place, i.e. there is no constant multiple that will recover the same penalty.

Regarding when/why a researcher may want to transform predictors

A common circumstance (discussed in the subsequent answer by @Paul) is that researchers will standardize their predictors so that all of the coefficients will be on the same scale. In that case, the size of the point estimates can give a rough idea of which predictors have the largest effect once the numerical magnitude of the predictor has been standardized.

Another reason a researcher may like to scale very large variables is so that the regression coefficients are not on an extremely tiny scale. For example, if you wanted to look at the influence of population size of a country on crime rate (couldn't think of a better example), you might want to measure population size in millions rather than in its original units, since the coefficient may be something like $.00000001$.

Macro
  • 40,561
  • 8
  • 143
  • 148
  • Two quick remarks: While the beginning of the post is correct, it misses the fact that centering *will* have an effect if an intercept is absent. :) Second, centering and rescaling has *important* effects if regularization is used. While the OP may not be considering this, it is still probably a useful point to keep in mind. – cardinal Jul 20 '12 at 01:05
  • The invariance to *rescaling* is also easily seen if one is comfortable with matrix notation. With $X$ full rank (for simplicity), $\hat y = X (X'X)^{-1} X'y$. Now if we replace $X$ by $X D$ where $D$ is diagonal we get $$\tilde y = (X D) ((XD)'XD)^{-1} (XD)'y = X D(D X'X D)^{-1} D X'y = X (X'X)^{-1} X'y = \hat y\>.$$ – cardinal Jul 20 '12 at 01:05
  • @cardinal, I've decided to mention the fact that, if your estimates are regularized then centering/scaling can have an impact. I resisted at first because I thought it would begin a long digression that may confuse those who are not familiar with regularizing but I found I could address it with relatively little space. Thanks-- – Macro Jul 20 '12 at 11:29
  • Not all my comments are necessarily meant to suggest that the answer should be updated. Many times I just like to slip in ancillary remarks under nice answers to give a couple thoughts on related ideas that might be of interest to a passer-by. (+1) – cardinal Jul 20 '12 at 13:26
  • Something funky is going on with the vote counting. Once again, I upvoted this when making my earlier comment and it didn't "take". Hmm. – cardinal Jul 20 '12 at 16:20
  • @cardinal, if you don't want to upvote me, no one is twisting your arm ;) Seriously though, I've no idea what the issue is. I've never noticed it but I don't always go back to questions I've voted on so I really don't know. – Macro Jul 20 '12 at 17:53
  • I think it may be MathJax related and has to do with the delay interval between compiling of the math and display and how that affects the progressive layout of the page. I saw a meta question on the math site recently where they report the same behavior. I guess the lesson is that you'll need to reduce the math in your answers if you want more votes... (though, relatively speaking, this page is not that math heavy.) – cardinal Jul 20 '12 at 17:57
1

The so called "normalization" is a common routine for most regression methods. There are two ways:

  1. Map each variable into [-1, 1] bounds (mapminmax in MatLab.
  2. Remove mean from each variable and divide on its standard devation (mapstd in MatLab), i.e. actually "normalize". If real mean an deviation is unknown just take sample characterisitics: $$\tilde{X}_{ij}=\frac{X_{ij}-\mu_i}{\sigma_i}$$ or $$\tilde{X}_{ij}=\frac{X_{ij} - \overline{X_i}}{std({X_i})}$$ where $E[X_i] = \mu$, $E[X_i^2-E[X_i]^2]=\sigma^2$, $\overline{X_i}=\frac{1}{N}\sum_{j=1}^{N}X_{ij}$ and $std({X_i}) = \sqrt{\frac{1}{N}\sum_{j=1}^{N}(X_{ij}^2 -\overline{X_{i}}^2)}$

As linear regression is very sensitive to the variables ranges I would generally suggest to normalize all the variables if you do not have any prior knowledge about the dependence and expect all the variables to be relativeley important.

The same goes for response variables, although it is not much important for them.

Why doing normalization or standartization? Mostly in order to determine relative impact of different variables in the model.that can be achieved if all variables are in the same units.

Hope this helps!

Paul
  • 624
  • 4
  • 8
  • What do you mean when you say *linear regression is very sensitive to the variables ranges*? For any `x1,x2,y` these two commands: `summary(lm(y~x1+x2))$r.sq` and `summary(lm(y~scale(x1)+scale(x2)))$r.sq` - the $R^2$ values when you don't standardize the coefficients and when you do - give the same value, indicating equivalent fit. – Macro Jul 19 '12 at 20:33
  • I was not completeley correct in the formuation. i meant the foolowing. The regression would be always the same (in sense of $\mathbf{R^2}$) if you perform only linear transformations of the data. But if you want to determine which variables are crusial and which are almost noisy the scale matters. It is just convinient to standartize variables and forget about their original scales. So regression is "sensetive" in terms of understanding relative impacts. – Paul Jul 19 '12 at 20:40
  • Thanks for clarifying, but *which variables are crusial and which are almost noisy the scale matters* is often decided upon by the $p$-value, which also won't change when you standardize (except the intercept, of course). I agree with your point that it does provide a nicer interpretation of the raw coefficient estimates. – Macro Jul 19 '12 at 20:44