I am trying to fit the best multivariate polynomial on a dataset using stepAIC()
. My problem is that I have more variables (p=3003) than observations (n=500), so when running the lm()
function on my data set I get NAs, and when using this model as a base model for the stepAIC()
I get an infinite value.
Here is the structure of my data:
> dput(head(RFSS_app1,3))
structure(list(RFSS.Scenario = c(350L, 447L, 785L), EConv_1 = c(-0.315512, -0.08557, -0.027885), FL_1 = c(0.084195258, -0.294956174, 0.37204179
), FS_1 = c(0.099359281, -0.25734015, 0.460677036), FT_1 = c(-0.360652792,
0.195988967, 0.141802042), Inf_1 = c(-0.179564905, -0.02855574,
0.090620805), CCorp_1 = c(0.249221095, -0.242114791, 0.020778011
), CSov_1 = c(0.173579203, -0.007359984, 0.125330282), MAnn_1 = c(-0.188385393,
0.339606463, 0.014542029), LExp_1 = c(0.493891494, 0.310671384,
0.44775971), X_1 = c(-0.059092292, -0.050541035, -0.087400051
), Total = c(1330994.53499487, 760365.862286965, 151841.273114568
), id = structure(c(1L, 1L, 1L), .Label = c("Y", "OOS", "SM"), class = "factor")), .Names = c("RFSS.Scenario",
"EConv_1", "FL_1", "FS_1", "FT_1", "Inf_1", "CCorp_1", "CSov_1",
"MAnn_1", "LExp_1", "X_1", "Total", "id"), row.names = c(3L,
4L, 7L), class = "data.frame")
Since I have 10 variables and I want to do the regression on all possible combinations up to the power 5, I use poly to generate the terms:
> #regression formula
>form<-as.formula(paste0("Total~poly(",paste0(names,"_1",collapse=","),",degree=5,raw=T)"))
> test<-lm(form,id=="Y",data=RFSS_app1)
I get coefficients up to the n predictors.
I run stepAIC()
afterwards as follow:
mod0=lm(Total~1,id=="Y",data=RFSS_app1)
modselect_f=stepAIC(mod0,form,data=RFSS_app1[RFSS_app1$id=="Y",],trace=TRUE,direction=c("forward"))
and my result is:
Start: AIC=13651.51
Total ~ 1
Df Sum of Sq RSS AIC
+ poly(EConv_1, FL_1, FS_1, FT_1, Inf_1, CCorp_1, CSov_1, MAnn_1, LExp_1, X_1, degree = 5, raw = T) 499 3.5874e+14 0.0000e+00 -Inf
<none> 3.5874e+14 13652
Step: AIC=-Inf
Total ~ poly(EConv_1, FL_1, FS_1, FT_1, Inf_1, CCorp_1, CSov_1,
MAnn_1, LExp_1, X_1, degree = 5, raw = T)
Warning message:
In stepAIC(mod0, form, data = RFSS_app1[RFSS_app1$id == "Y", ], :
bytecode version mismatch; using eval
I am wondering if the issue in the stepAIC()
is that the terms from the function poly()
aren't considered separately.
Any thoughts on the subjects ? Also, is my regression considered as high-dimensional since n<p
or no since the terms are depending on the initial 10 variables?
Should I consider other methods such as Ridge or Lasso regression ?
Thank you for your precious help.