How to correctly predict from orthogonalized covariates on new data in polynomial regression?

Question

One option for a polynomial linear regression it to transform a covariate x to a matrix of its n-th polynomial and subsequently run a linear regression model. The function poly in R orthogonalize the columns of this matrix or leave them as they are after polynomial extension (raw). I observed choosing to orthogonalize in this was may lead to problems in predictions on new data. However orthogonalizing the columns seems important as linear regression suffers when covariates are highly correlated (as do other methods such as Ridge regression).

An illustration I simulate data from a third order polynomial model (black points). Then I fit a linear model using the orthogonal transformation of poly and predict using the same data (purple points). Next I choose new data x_new, run the orthogonal transformation on it, and predict using the same model. The predictions differ (blue points)!

As noted by @hdx1011 using the poly function within lm avoids the problem. So the function internally has a way to deal with it. However in an application I need to implement the matrix product explicitly as in the code below.

So how can I implement this matrix product ( x_mat %*% as.matrix(coef(mod1))) or re-scale the matrix poly(x_new, 3) to get to the correct predictions?

Code.

n = 1000
x = rnorm(n,0,1)
y = 1 + x + x^2 + x^3 + rnorm(n)

x_mat = poly(x, 3)
mod1  = lm(y ~ x_mat)
x_mat = cbind(rep(1,n), x_mat)
pred  = x_mat %*% as.matrix(coef(mod1))

x_new     = seq(-3, 3, length.out = 600)
x_mat_new = cbind(rep(1,600), poly(x_new, 3) )
pred_new  = x_mat_new %*% as.matrix(coef(mod1))

plot(x,y)
points(x, pred, col = "purple")
points(x_new, pred_new, col = "blue")

i think I asked a very similar (if not the same) question here. https://stats.stackexchange.com/questions/276987/why-i-am-getting-different-predictions-for-manual-polynomial-expansion-and-using — Haitao Du, Aug 04 '17 at 13:35
@hdx1011 Thanks - do you have the short answer for me? In particular how I can I get correct predictions when predicting on new data with while the columns were orthogolized when estimating the model? — tomka, Aug 04 '17 at 13:37
short answer is using lm(y~poly(x)) instead of doing poly out side of lm — Haitao Du, Aug 04 '17 at 13:41
@Firebug In my oppinion this question is inherently about statisitcs, i.e. how to predict. Although the answer turned out to be a programming issue, the underlying problem is really statistical-mathematical and related to how matrices are orthogonalized and how the same orthogonalization can be applied to the covariates matrix of a test set. Furthermore, non-programming answers could consider the mathematical procedure underlying ´poly´. — tomka, Aug 07 '17 at 20:52

tomka · Accepted Answer · 2017-08-04T19:21:18.130

The answer is to use the coefs argument of poly. The coefficients used in the orthogonal transformation are saved as attributes in a poly object. They have to be saved for later use.

In the implementation this will look as follows:

n = 1000
x = rnorm(n,0,1)
y = 1 + x + x^2 + x^3 + rnorm(n)

x_mat = poly(x, 3)
coefs = attr(x_mat,"coefs")                           #save coefficients for later use
mod1  = lm(y ~ x_mat)
x_mat = cbind(rep(1,n), x_mat)
pred  = x_mat %*% as.matrix(coef(mod1))

x_new     = seq(-3, 3, length.out = 600)
x_mat_new = cbind(rep(1,600), poly(x_new, 3, coefs = coefs) ) #use coef argument
pred_new  = x_mat_new %*% as.matrix(coef(mod1)) 

plot(x,y)
points(x, pred, col = "purple")
points(x_new, pred_new, col = "blue")

@hxd1011 perhaps you find this solution interesting – tomka Aug 04 '17 at 15:38 — tomka, Aug 04 '17 at 15:38

How to correctly predict from orthogonalized covariates on new data in polynomial regression?

1 Answers1