I'm currently working on a dataset and I'm using the AIC criterion with the function step
in R to achieve variable selection. Doing so, my model has reduced from 48 variables to 24 variables:
Residuals:
Min 1Q Median 3Q Max
-2576.8 -190.4 -93.8 33.4 15548.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 270.472 16.482 16.410 < 2e-16 ***
order 382.376 33.294 11.485 < 2e-16 ***
report 232.535 25.863 8.991 < 2e-16 ***
`000` 152.409 27.783 5.486 4.36e-08 ***
font 59.870 8.333 7.185 7.94e-13 ***
re -33.898 8.701 -3.896 9.95e-05 ***
receive 248.213 51.176 4.850 1.28e-06 ***
people 113.395 29.255 3.876 0.000108 ***
you -25.046 5.503 -4.551 5.48e-06 ***
george -17.225 5.100 -3.377 0.000738 ***
money 66.370 21.565 3.078 0.002100 **
addresses 131.646 40.453 3.254 0.001146 **
remove -77.934 23.188 -3.361 0.000784 ***
`650` -34.678 19.511 -1.777 0.075575 .
meeting -34.344 11.319 -3.034 0.002427 **
make 77.576 30.773 2.521 0.011743 *
project -30.727 13.822 -2.223 0.026266 *
edu -20.688 9.757 -2.120 0.034028 *
pm -36.348 20.704 -1.756 0.079223 .
`1999` 51.903 23.057 2.251 0.024430 *
original -70.763 41.520 -1.704 0.088396 .
labs -39.932 23.732 -1.683 0.092524 .
mail 23.066 13.974 1.651 0.098894 .
your -12.496 8.503 -1.470 0.141754
free -16.432 11.457 -1.434 0.151583
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 575.9 on 4185 degrees of freedom
Multiple R-squared: 0.1385, Adjusted R-squared: 0.1336
F-statistic: 28.04 on 24 and 4185 DF, p-value: < 2.2e-16
Now, the variables free
and your
don't seem to be significant. To decide whether to keep them in my model or not, I have compared the models with/without your
and with/without free
repeating 20 times a 10-fold cross validation.
Here's my code:
---------testing if "your" is useful to the model-----------------
me=0
mf=0
for (i in c(1:20)) {
e<-cv.lm(model, fit3, m=10,printit=FALSE,seed=i) #fit3 is the model without "your"
f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
me=me+attr(e,"ms")
mf=mf+attr(f,"ms")
}
me=me/20 #result : 334718
mf=mf/20 #result : 334659 ==> we keep "your"
---------testing if "free" is useful to the model-----------------
me=0
mf=0
for (i in c(1:20)) {
e<-cv.lm(model, fit2, m=10,printit=FALSE,seed=i) #fit2 is the model without "free"
f<-cv.lm(model, fit, m=10,printit=FALSE,seed=i) #fit is the unrestricted model
me=me+attr(e,"ms")
mf=mf+attr(f,"ms")
}
me=me/20 #result : 334726
mf=mf/20 #result : 334659 ==> we keep "free"
Now my questions are the following:
- Is that a good way to proceed? (Meaning: should I use CV to determine whether to keep those variables in my model or not?)
- Or should I perform some kind of hypothesis test to assess the relevance of those two variables?