I'm not sure whether to ask this here or on Stack OverFlow, but I think here is better because the question is based more around the theoretical aspects than specific code.
I'm currently working through Applied Linear Statistical Models - Kutner 5th ed and I'm on the chapter discussing polynomial regressions. In the chapter the model for a one predictor variable raised to the second power polynomial is given as
$$Y_{i} = \beta_{0} + \beta_{1}x_{i} + \beta_{2}x_{i}^{2} + \epsilon_{i}$$
where $x_{i} = X_{i} - \bar{X}$ and $X_{i}$ is the original data point before "centring".
In a quest to have a more thorough understanding of things I tend to do a set of exercises "manually" first and then exercises asking for similar things but this time leveraging the functions that are in R directly. In some cases I use the functions in R to verify if I'm on the right track.
A question asked me to fit a second order regression model to a data set. I attempted to do this by using the poly()
function in R along with the respective lm()
function needed. Fortunately I have solutions to exercises and the solution was different to what I obtained. So I went back and did the calculation manually with the expression as above and obtained the correct coefficients.
I wanted to know why there was a difference? I was under the impression that the poly()
function is "centring" the data points in the same way I'd do it manually and that is serving to orthogonalize my polynomial in the necessary way in order for multicollinearity to disappear and obtain the correct coefficients. But I suppose my understanding is flawed.
Below is a reproducible example with commentary of what I did at each step:
#dataframe for reproduction
dput(reprod_example)
structure(list(muscle_mass = c(106, 106, 97, 113, 96, 119, 92,
112, 92, 102), age = c(43, 41, 47, 46, 45, 41, 47, 41, 48, 48
), age_centred = c(-1.7, -3.7, 2.3, 1.3, 0.299999999999997, -3.7,
2.3, -3.7, 3.3, 3.3), age_centred_sq = c(2.89000000000001, 13.69,
5.28999999999999, 1.68999999999999, 0.0899999999999983, 13.69,
5.28999999999999, 13.69, 10.89, 10.89)), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
#manually creating the transformed data points to use in the regression and put in my data frame
age_centred_reprod = with(data = reprod_example, age - mean(age))
reprod_example$age_centred = age_centred_reprod
reprod_example$age_centred_sq = age_centred_reprod^2
#created polynomial regression functions. First one is from "manual" creation of model,
the other two are applying poly() function to lm(). One with raw = TRUE and one with raw = FALSE respectively
reprod_example_deg_2_manual_test = lm(data = reprod_example, formula = muscle_mass ~ age_centred + age_centred_sq)
reprod_example_lm_deg_2_raw = lm(data = reprod_example, formula = muscle_mass ~ poly(x = age, degree = 2, raw = TRUE))
reprod_example_lm_deg_2_orthog = lm(data = reprod_example, formula = muscle_mass ~ poly(x = age, degree = 2, raw = FALSE))
#the coefficient estimates from the 3 regression models. The "manual" model is
the one that produced the "correct" values with respect to the textbook exercise
> brief(reprod_example_deg_2_manual_test)
(Intercept) age_centred age_centred_sq
Estimate 102.89 -2.202 0.0776
Std. Error 4.58 0.916 0.5052
Residual SD = 7.37 on 7 df, R-squared = 0.513
> brief(reprod_example_lm_deg_2_raw)
(Intercept) poly(x = age, degree = 2, raw = TRUE)1 poly(x = age, degree = 2, raw = TRUE)2
Estimate 356 -9.14 0.0776
Std. Error 989 44.79 0.5052
Residual SD = 7.37 on 7 df, R-squared = 0.513
> brief(reprod_example_lm_deg_2_orthog)
(Intercept) poly(x = age, degree = 2, raw = FALSE)1 poly(x = age, degree = 2, raw = FALSE)2
Estimate 103.50 -19.97 1.13
Std. Error 2.33 7.37 7.37
Residual SD = 7.37 on 7 df, R-squared = 0.513
I do notice that the regression coefficients between my "manual" model and the orthogonalized polynomial model are somewhat similar. And there is most likely a reason for that. But I haven't acquired the knowledge to fully understand why that is the case. I was expecting those two sets of regression coefficients to be the same, but alas it did not turn out to be the case.
Would somebody be able to explain to me what it is that's going on?