Selecting polynomial terms in regression

Question

I'm developing a nonlinear response correction for a sensor (to transform "raw.peak" to "target"). I don't care about interpretability. I do care about future accuracy.

One might first just throw it straight into least squares, in R:

> fit.lm <- lm(target ~ poly(raw.peak, 20), var500, subset = train)
> summary(fit.lm)

Call:
lm(formula = target ~ poly(raw.peak, 20), data = var500, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
 -742.7   -58.0    -9.7    55.8 11661.3 

Coefficients:
                       Estimate Std. Error  t value Pr(>|t|)    
(Intercept)           17650.218      3.508 5031.906  < 2e-16 ***
poly(raw.peak, 20)1  427439.542    289.378 1477.096  < 2e-16 ***
poly(raw.peak, 20)2  -37413.844    292.837 -127.763  < 2e-16 ***
poly(raw.peak, 20)3  -39947.963    291.868 -136.870  < 2e-16 ***
poly(raw.peak, 20)4  -15680.699    288.243  -54.401  < 2e-16 ***
poly(raw.peak, 20)5     -28.481    290.655   -0.098 0.921946    
poly(raw.peak, 20)6    6362.216    290.386   21.910  < 2e-16 ***
poly(raw.peak, 20)7    3447.940    288.974   11.932  < 2e-16 ***
poly(raw.peak, 20)8   -1472.930    295.174   -4.990 6.30e-07 ***
poly(raw.peak, 20)9   -4381.197    291.898  -15.009  < 2e-16 ***
poly(raw.peak, 20)10  -2468.288    280.874   -8.788  < 2e-16 ***
poly(raw.peak, 20)11   1436.665    291.826    4.923 8.87e-07 ***
poly(raw.peak, 20)12   3072.431    291.870   10.527  < 2e-16 ***
poly(raw.peak, 20)13   1480.074    281.964    5.249 1.61e-07 ***
poly(raw.peak, 20)14    106.258    288.824    0.368 0.712967    
poly(raw.peak, 20)15    102.198    297.868    0.343 0.731543    
poly(raw.peak, 20)16   1009.355    288.859    3.494 0.000481 ***
poly(raw.peak, 20)17   1670.275    289.112    5.777 8.18e-09 ***
poly(raw.peak, 20)18   1136.865    295.144    3.852 0.000119 ***
poly(raw.peak, 20)19     73.071    292.661    0.250 0.802849    
poly(raw.peak, 20)20   -785.794    293.245   -2.680 0.007400 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Of course you would do proper model selection and look at the MSE as you consider higher powers, and you'd end up with a model that used all orders up to k). Just looking at significance, we might wonder if we should consider even higher powers than 20! And then there are a few gaps. Let's put that aside for a moment, bearing in mind we are concerned only about accuracy. Let's use lasso here. First, we find the best value of tuning parameter (by cross-validation):

y <- var500$target
x <- model.matrix(target ~ poly(raw.peak, 20), var500)[, -1]
lasso.mod <- cv.glmnet(x[train, ], y[train], alpha = 1)

and then we can look at the resultant coefficients:

> predict(lasso.mod, type = "coefficients", s = lasso.mod$lambda.min)
21 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)           17649.9905
poly(raw.peak, 20)1  424714.3867
poly(raw.peak, 20)2  -34375.9387
poly(raw.peak, 20)3  -36838.2749
poly(raw.peak, 20)4  -12662.3833
poly(raw.peak, 20)5       .     
poly(raw.peak, 20)6    3645.3184
poly(raw.peak, 20)7    1020.6148
poly(raw.peak, 20)8       .     
poly(raw.peak, 20)9    -930.5491
poly(raw.peak, 20)10      .     
poly(raw.peak, 20)11      .     
poly(raw.peak, 20)12      .     
poly(raw.peak, 20)13      .     
poly(raw.peak, 20)14      .     
poly(raw.peak, 20)15      .     
poly(raw.peak, 20)16      .     
poly(raw.peak, 20)17      .     
poly(raw.peak, 20)18      .     
poly(raw.peak, 20)19      .     
poly(raw.peak, 20)20      .

This seems to be much more like it! With this value of tuning parameter, I get to stop at 9th order. This model, however, also sets the coefficients for the 5th and 8th order to zero. I'm okay with this. The nub of my question is "Is there a reason why I wouldn't be okay with this?" If I were wanting to use this result for inference (as you do with high-degree polynomials ;-) ) then I would be interested in responses that talk about including lower order terms/interactions.

My first attempt at asking about this (here) got it marked as a duplicate by gung, not unreasonably so but I think mostly because I concentrated on the linear/quadratic example and the responses were mostly aimed at interactions and interpretability. A more relevant previous question is this, I believe, where gung (again) rightly points out

The short answer is that by not including certain terms in the model, you force parts of it to be exactly zero. This imposes an inflexibility to your model that necessarily causes bias, unless those parameters are exactly zero in reality; the situation is analogous to suppressing the intercept (which you can see discussed here).

This is perfectly true, of course, but the point of lasso, and CV, is that you're trading a decrease in variance against an increase in bias to give the lowest error. To stress again, I'm not concerned with interpretability, I'm not asking about main effects becoming insignificant when you add interactions, I'm purely concerned with future accuracy. If I were putting this as a vote, I'd offer the choices of "This approach is fine" or "lasso doesn't know that your model matrix columns are successive powers of the same variable so you should apply the principal of ... and refit using all powers up to your indicated highest".

Since the goal is predictive power, I don't think the arguments around excluding lower terms apply. In fact, you've already given them all to the regressor, so you can view it as the regressor giving the excluded lower order terms extermely small coefficients. I'd be wary of using polynomials with such high order though, as it allows for too much wiggliness whereas you'd expect a smoother response correction. But of course, you know more about the nature of your sensor. — syockit, Nov 02 '21 at 06:18

Selecting polynomial terms in regression

0 Answers0