Centering in linear regression

Question

I am trying to fit a quadratic to my model, I have tuples (x,y).

The choices are,

1) lm(y~x+I(x^2))

2) lm(y~(x-mean(x))+I(x-mean(x))^2)

3) lm(y~(x-mean(x))+I(x^2 - mean(x^2)))

In other words, in 3, I am centering the quadratic term, using its own mean.

I do understand that centering to reduce multicollinearity is not an issue here. I am just looking to understand how to center in general. Intuitively 3) makes more sense, I am treating the linear and the quadratic vars as separate and just centering them in a usual way. 2 is odd because the quadratic term will also have a linear component once you open the squares up. 1 and 3 give the same coefficients which is different from 2, but there seems to be no relationship between the linear coefficient from 2 and 1. The quadratic coefficient is the same across all models.

The outputs are

model 1)

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
    Min      1Q  Median      3Q     Max 
-73.845 -10.151   1.224   9.660  73.553 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) 262.709845  82.982956   3.166   0.0016 **
x             0.150473   1.346574   0.112   0.9111   
I(x^2)       -0.002182   0.005459  -0.400   0.6895

model 2)

Call:
lm(formula = y ~ (x-mean(x)) + (x-mean(x))^2)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.845 -10.151   1.224   9.660  73.553 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 247.263060   0.657972 375.796   <2e-16 ***
x -mean(x)    -0.396789   0.080544  -4.926    1e-06 ***
(x -mean(x))^2 -0.002182   0.005459  -0.400     0.69

And model 3)

Call:
lm(formula = y ~ (x - mean(x)) + I(x^2 - mean(x^2)))

Residuals:
    Min      1Q  Median      3Q     Max 
-73.845 -10.151   1.224   9.660  73.553 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        247.138199   0.579052 426.798   <2e-16 ***
x - mean(x)         0.150473   1.346574   0.112    0.911    
I(x^2 - mean(x^2))  -0.002182   0.005459  -0.400    0.690

Notice 1 and 3) give the same coefficient estimates and 2 is different for the coefficient on the linear term. The coefficient of the quadratic term all agree. The model 2 is significant for the linear term and the other ones are not, why?

Could you elaborate on what you mean by "center in general"? What are you hoping to achieve? If I might venture to guess, you could be interested in orthogonal polynomials--but it's hard to tell. — whuber, Apr 08 '15 at 18:58
I am not trying to achieve anything fancy actually. When I say center in general, I mean subtracting a mean from an independent variable. If I disregard for a second that my second var is the square of the first, the problem boils down to linearly regressing y on two variables and standard centering would get me model 3. — gbh., Apr 08 '15 at 19:36
I guess the question is that which is correct model 2 or 3, they both yield different measures of significance on the linear and the quadratic term. — gbh., Apr 08 '15 at 19:53
For instance this article seems to suggest that you should center first and then build the model at the second step. http://rtutorialseries.blogspot.com/2010/02/r-tutorial-series-basic-polynomial.html. This could also lead to biased estimates though. — gbh., Apr 08 '15 at 21:19
The three models are equally "correct." What differs are what hypotheses they automatically test about the coefficients. Perhaps, then, you are trying to ask about how to test hypotheses about coefficients in polynomial regressions? — whuber, Apr 08 '15 at 21:21
How does the hypothesis differ, please elaborate? Again my goal is just to fit a quadratic, and understand the significance in the final relationship. I do understand that the final estimate of the y for all the 3 models is the same but the signifiance changes as I have shown in the outputs. Please elaborate on how the hypothesis is changing? Thanks a lot! — gbh., Apr 08 '15 at 21:27
Yes, I am trying to ask that. How to do that, and how to get around the fact that sometime the linear term is significant and sometimes its not, as the above shows. — gbh., Apr 08 '15 at 21:33
I would think models 1 and 3 are closer. The quardratic term in model is more like a U shape than the intended accelerating or decelerating curve, because squaring a negative number will yield a positive result. — Penguin_Knight, Apr 08 '15 at 21:49

score 1 · Accepted Answer · answered Apr 08 '15 at 21:48

When you fit a regression model for a single variable and its squared effect, the interpretation of coefficient for the linear term changes. The coefficient for the linear term is the instantaneous slope of the parabola at the intercept. Therefore, it's easy to see how models 1 and 2 differ. Model 1 will give you the tangent line for the parabola at the intercept whereas Model 2 will give you the tangent line at the mean of $x$. Model 3, on the other hand is very complicated... you can see with some algebra what you are fitting:

$$ \begin{eqnarray} y &=& a + b*(x-\bar{x}) + c * (x^2 - \bar{x}^2)\\ &=& a + b*(x-\bar{x}) + c * (x-\bar{x})^2 - 2c(x - \bar{x})\\ &=& a + (b-2c)*(x-\bar{x}) + c * (x-\bar{x})^2 \\ \end{eqnarray} $$

Hmmm thanks. I would say the 1) and 3) are similar given the coefficients also. And 2) is the different one. — gbh., Apr 08 '15 at 22:46
It's hard to discern at this point whether the similarities you observe are an artifact of the data generation. I have only quoted what is supported by the theory in general. Linear effects are instantaneous change at the intercept when controlling for the higher quadratic trend. — AdamO, Apr 08 '15 at 22:52

Centering in linear regression

1 Answers1

Linked