I'm developing a nonlinear response correction for a sensor (to transform "raw.peak" to "target"). I don't care about interpretability. I do care about future accuracy.
One might first just throw it straight into least squares, in R:
> fit.lm <- lm(target ~ poly(raw.peak, 20), var500, subset = train)
> summary(fit.lm)
Call:
lm(formula = target ~ poly(raw.peak, 20), data = var500, subset = train)
Residuals:
Min 1Q Median 3Q Max
-742.7 -58.0 -9.7 55.8 11661.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17650.218 3.508 5031.906 < 2e-16 ***
poly(raw.peak, 20)1 427439.542 289.378 1477.096 < 2e-16 ***
poly(raw.peak, 20)2 -37413.844 292.837 -127.763 < 2e-16 ***
poly(raw.peak, 20)3 -39947.963 291.868 -136.870 < 2e-16 ***
poly(raw.peak, 20)4 -15680.699 288.243 -54.401 < 2e-16 ***
poly(raw.peak, 20)5 -28.481 290.655 -0.098 0.921946
poly(raw.peak, 20)6 6362.216 290.386 21.910 < 2e-16 ***
poly(raw.peak, 20)7 3447.940 288.974 11.932 < 2e-16 ***
poly(raw.peak, 20)8 -1472.930 295.174 -4.990 6.30e-07 ***
poly(raw.peak, 20)9 -4381.197 291.898 -15.009 < 2e-16 ***
poly(raw.peak, 20)10 -2468.288 280.874 -8.788 < 2e-16 ***
poly(raw.peak, 20)11 1436.665 291.826 4.923 8.87e-07 ***
poly(raw.peak, 20)12 3072.431 291.870 10.527 < 2e-16 ***
poly(raw.peak, 20)13 1480.074 281.964 5.249 1.61e-07 ***
poly(raw.peak, 20)14 106.258 288.824 0.368 0.712967
poly(raw.peak, 20)15 102.198 297.868 0.343 0.731543
poly(raw.peak, 20)16 1009.355 288.859 3.494 0.000481 ***
poly(raw.peak, 20)17 1670.275 289.112 5.777 8.18e-09 ***
poly(raw.peak, 20)18 1136.865 295.144 3.852 0.000119 ***
poly(raw.peak, 20)19 73.071 292.661 0.250 0.802849
poly(raw.peak, 20)20 -785.794 293.245 -2.680 0.007400 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Of course you would do proper model selection and look at the MSE as you consider higher powers, and you'd end up with a model that used all orders up to k). Just looking at significance, we might wonder if we should consider even higher powers than 20! And then there are a few gaps. Let's put that aside for a moment, bearing in mind we are concerned only about accuracy. Let's use lasso here. First, we find the best value of tuning parameter (by cross-validation):
y <- var500$target
x <- model.matrix(target ~ poly(raw.peak, 20), var500)[, -1]
lasso.mod <- cv.glmnet(x[train, ], y[train], alpha = 1)
and then we can look at the resultant coefficients:
> predict(lasso.mod, type = "coefficients", s = lasso.mod$lambda.min)
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 17649.9905
poly(raw.peak, 20)1 424714.3867
poly(raw.peak, 20)2 -34375.9387
poly(raw.peak, 20)3 -36838.2749
poly(raw.peak, 20)4 -12662.3833
poly(raw.peak, 20)5 .
poly(raw.peak, 20)6 3645.3184
poly(raw.peak, 20)7 1020.6148
poly(raw.peak, 20)8 .
poly(raw.peak, 20)9 -930.5491
poly(raw.peak, 20)10 .
poly(raw.peak, 20)11 .
poly(raw.peak, 20)12 .
poly(raw.peak, 20)13 .
poly(raw.peak, 20)14 .
poly(raw.peak, 20)15 .
poly(raw.peak, 20)16 .
poly(raw.peak, 20)17 .
poly(raw.peak, 20)18 .
poly(raw.peak, 20)19 .
poly(raw.peak, 20)20 .
This seems to be much more like it! With this value of tuning parameter, I get to stop at 9th order. This model, however, also sets the coefficients for the 5th and 8th order to zero. I'm okay with this. The nub of my question is "Is there a reason why I wouldn't be okay with this?" If I were wanting to use this result for inference (as you do with high-degree polynomials ;-) ) then I would be interested in responses that talk about including lower order terms/interactions.
My first attempt at asking about this (here) got it marked as a duplicate by gung, not unreasonably so but I think mostly because I concentrated on the linear/quadratic example and the responses were mostly aimed at interactions and interpretability. A more relevant previous question is this, I believe, where gung (again) rightly points out
The short answer is that by not including certain terms in the model, you force parts of it to be exactly zero. This imposes an inflexibility to your model that necessarily causes bias, unless those parameters are exactly zero in reality; the situation is analogous to suppressing the intercept (which you can see discussed here).
This is perfectly true, of course, but the point of lasso, and CV, is that you're trading a decrease in variance against an increase in bias to give the lowest error. To stress again, I'm not concerned with interpretability, I'm not asking about main effects becoming insignificant when you add interactions, I'm purely concerned with future accuracy. If I were putting this as a vote, I'd offer the choices of "This approach is fine" or "lasso doesn't know that your model matrix columns are successive powers of the same variable so you should apply the principal of ... and refit using all powers up to your indicated highest".