0

Many textbooks and articles (such as this one) advise to standardize variables before entering them into our regression models; i.e., (variable - mean) / standard deviation

However, I just came across a counter example. How do you rationalize this?

Let's say: $$Y = 2X$$ $$X = \{-3, -2, -1, 0, 1, 2, 3 \}$$ $$Y = \{-6, -4, -2, 0, 2, 4, 6 \}$$ After standardizing, $$X_{sd} = \{-1.3887301, -0.9258201, -0.46291, 0.0, 0.46291, 0.9258201, 1.3887301 \}$$ $$Y_{sd} = \{-1.3887301, -0.9258201, -0.46291, 0.0, 0.46291, 0.9258201, 1.3887301 \}$$ $$\Rightarrow Y_{sd} = X_{sd}$$ So, in this example, if we standardize the variables, the desired slope (i.e., the effect size) vanishes!!!

user2521204
  • 145
  • 4
  • 1
    No slope coefficient will have units of y / units of this x if you standardize this x or y or both. I never standardise. I do sometimes shift predictors, for example intercept and slope if calendar year is a predictor are often absurd so use instead say year $-$ 2000 (or some other convenient year within the data range). – Nick Cox Jan 09 '22 at 18:30
  • 1
    Many similar questions, some: https://stats.stackexchange.com/questions/48360/is-standardization-needed-before-fitting-logistic-regression, https://stats.stackexchange.com/questions/287370/standardization-vs-normalization-for-lasso-ridge-regression, https://stats.stackexchange.com/questions/223432/standardising-non-normally-distributed-predictors-for-regression, https://stats.stackexchange.com/questions/99761/do-you-ever-center-and-standardize-variables-in-multiple-regression, https://stats.stackexchange.com/questions/342140/standardization-of-continuous-variables-in-binary-logistic-regression – kjetil b halvorsen Jan 09 '22 at 19:35

2 Answers2

2

The article you are referring (correctly) advises to center covariates involving higher order polynomal terms or interactions. The reason is simple: it reduces collinearity across related terms and thus improves stability of coefficient estimates and their interpretation.

Your example does not make much sense because $Y$ seems to be the response variable. Any transformation on the response changes the result, including standardization.

Michael M
  • 10,553
  • 5
  • 27
  • 43
  • So, you mean we should only standardize the covariates and not the response variable, right? That makes sense, thank you. – user2521204 Jan 09 '22 at 17:43
  • 1
    Not all covariates. But those involving interactions and polynomial terms are candidates for centering. – Michael M Jan 09 '22 at 17:45
2

If you transform the target variable, you would need to back-transform the predictions to get the predicted values on the appropriate scale. Since linear regression is a linear model, in your example this is the same as transforming the $\beta$ parameter.

> x <- c(-3,-2,-1,0,1,2,3)
> y <- 2 * x
> xz <- x / sd(x)
> yz <- y / sd(y)
> lm(yz~xz)

Call:
lm(formula = yz ~ xz)

Coefficients:
(Intercept)           xz  
          0            1  

> predict(lm(yz~xz)) * sd(y)
 1  2  3  4  5  6  7 
-6 -4 -2  0  2  4  6
> coef(lm(yz~xz))[2] * sd(y) * xz
[1] -6 -4 -2  0  2  4  6

So nothing is wrong, you scaled the data and got the scaled parameters and predictions.

You don't need to standardize the data by default. There are scenarios when you would need to do this, e.g. using polynomials, as mentioned in the linked post, or when using regularization, or for some models other than linear regression.

Tim
  • 108,699
  • 20
  • 212
  • 390