forward model selection on multivariate polynomial regression with high dimension data

Question

I am trying to fit the best multivariate polynomial on a dataset using stepAIC(). My problem is that I have more variables (p=3003) than observations (n=500), so when running the lm() function on my data set I get NAs, and when using this model as a base model for the stepAIC() I get an infinite value.

Here is the structure of my data:

    > dput(head(RFSS_app1,3))
    structure(list(RFSS.Scenario = c(350L, 447L, 785L), EConv_1 = c(-0.315512, -0.08557, -0.027885), FL_1 = c(0.084195258, -0.294956174, 0.37204179
), FS_1 = c(0.099359281, -0.25734015, 0.460677036), FT_1 = c(-0.360652792, 
0.195988967, 0.141802042), Inf_1 = c(-0.179564905, -0.02855574, 
0.090620805), CCorp_1 = c(0.249221095, -0.242114791, 0.020778011
), CSov_1 = c(0.173579203, -0.007359984, 0.125330282), MAnn_1 = c(-0.188385393, 
0.339606463, 0.014542029), LExp_1 = c(0.493891494, 0.310671384, 
0.44775971), X_1 = c(-0.059092292, -0.050541035, -0.087400051
), Total = c(1330994.53499487, 760365.862286965, 151841.273114568
), id = structure(c(1L, 1L, 1L), .Label = c("Y", "OOS", "SM"), class = "factor")), .Names = c("RFSS.Scenario", 
"EConv_1", "FL_1", "FS_1", "FT_1", "Inf_1", "CCorp_1", "CSov_1", 
"MAnn_1", "LExp_1", "X_1", "Total", "id"), row.names = c(3L, 
4L, 7L), class = "data.frame")

Since I have 10 variables and I want to do the regression on all possible combinations up to the power 5, I use poly to generate the terms:

> #regression formula 
>form<-as.formula(paste0("Total~poly(",paste0(names,"_1",collapse=","),",degree=5,raw=T)")) 
> test<-lm(form,id=="Y",data=RFSS_app1)

I get coefficients up to the n predictors.

I run stepAIC() afterwards as follow:

mod0=lm(Total~1,id=="Y",data=RFSS_app1)
modselect_f=stepAIC(mod0,form,data=RFSS_app1[RFSS_app1$id=="Y",],trace=TRUE,direction=c("forward"))

and my result is:

Start:  AIC=13651.51
Total ~ 1

                                                                                                     Df  Sum of Sq        RSS   AIC
+ poly(EConv_1, FL_1, FS_1, FT_1, Inf_1, CCorp_1, CSov_1, MAnn_1, LExp_1, X_1, degree = 5, raw = T) 499 3.5874e+14 0.0000e+00  -Inf
<none>                                                                                                             3.5874e+14 13652

Step:  AIC=-Inf
Total ~ poly(EConv_1, FL_1, FS_1, FT_1, Inf_1, CCorp_1, CSov_1, 
    MAnn_1, LExp_1, X_1, degree = 5, raw = T)

Warning message:
In stepAIC(mod0, form, data = RFSS_app1[RFSS_app1$id == "Y", ],  :
  bytecode version mismatch; using eval

I am wondering if the issue in the stepAIC() is that the terms from the function poly() aren't considered separately.

Any thoughts on the subjects ? Also, is my regression considered as high-dimensional since n<p or no since the terms are depending on the initial 10 variables?

Should I consider other methods such as Ridge or Lasso regression ?

Thank you for your precious help.

Could you please say a bit more about what question you are trying to answer by fitting a high-order polynomial multiple-predictor model? I suspect that there might be more efficient and reliable ways to address the underlying question. Forward stepwise regression is frowned upon by many who frequent this site, and polynomial models can pose problems, too. — EdM, Jul 05 '18 at 16:36
Thank you for your answer @EdM. My dependent variable _y_ is calculated from the explanatory variables and their interaction using heavy simulations. Say I need to have 10000 values for _y_ which is too time consuming, therefor I am trying to fit the best polynomial form on a subset of _y_ ( say 500 values) in order to predict the remaining 9500 values based on the obtained form. — Inas O, Jul 06 '18 at 09:22

score 1 · Answer 1 · answered Jul 07 '18 at 02:14

As I understand the situation, with your simulations you can get reasonably exact values of the outcome variable y for any given set of predictor values (10 predictors), but the computational cost is too high to do the simulations on any more than a small subset (say, 500) of the sets of predictor values of interest (say, 10000 sets of values of the 10 predictors). You are trying to estimate the values of y for the rest of the sets of predictor values with linear regression, involving interactions among all predictors up to the 5th power, and thus are running into difficulties with fewer cases than potential predictors (including all the interactions).

This is a problem in what is called multivariate interpolation, a field with a well developed set of approaches. One of these is Kriging. As noted here, this can be applied to the type of situation you describe, and in principle can be applied to any number of dimensions of predictors. The theoretical basis is explained on this page. Multidimensional smoothing splines and adaptive regression splines provide other approaches. Implementations in software packages are available, as described for example here and here.

The approach you proposed might be interpreted as an attempt to approximate y with a 5th-order Taylor approximation in the 10 predictor variables, while keeping only the terms that are "most important." One difficulty is that stepwise selection to find the "most important" predictors can be highly problematic. Furthermore, the penalized approaches you mention (LASSO, ridge regression) necessarily introduce a bias into the predictions, as part of the bias-variance tradeoff. Kriging, in contrast, provides the Best Linear Unbiased Prediction (BLUP) of a random field.

Finally, do not forget that interpolation of any type might be unreliable outside the subspace within which you are doing the interpolation. In a high-dimensional predictor space, it might be difficult to determine whether new sets of predictor values fit safely into that subspace. So, however you proceed, take caution to avoid hoped-for interpolations that are really extrapolations.

forward model selection on multivariate polynomial regression with high dimension data

1 Answers1