I have yet to find a sufficient and succinct answer regarding model building with 10-fold cross validation (in this case, using Caret). I've found responses here, for instance: https://stackoverflow.com/questions/33470373/applying-k-fold-cross-validation-model-using-caret-package, and here: How to choose a predictive model after k-fold cross-validation?.
Would one build a model with all of the data, using R^2 (in this case, I'm just doing simple multiple regression) and likelihood ratio tests to determine the best (parsimonious) model, and then run a 10-fold validation? I've done train/test cross-validation, where one would build a model with, say, 70% of data, and then obtain an RMSE by comparing model predictions (from test dataset) w/ actual observations, but it isn't immediately clear to me how this translates to something like a 10-fold--and whether one, again, in the case of 10-fold, would just build the model with the entire dataset, vs a subsection.
After determining the best model, one would implement a package like Caret
train_control<- trainControl(method="cv", number=10)
model<- train(resp~ (all variables included in final model), data=mydat, trControl=train_control, method="rpart")
and then one would just obtain an RMSE as usual by model$pred
and comparing this with actual values?
In this particular instance, I have a final sample of 350 participants. These are from the Camp Fire wild disaster in California in 2018, looking at factors like emotional support, family support, services, etc. (roughly 12 variables of interest, ~6 of which remain if I were to use all data) as a function of well-being. Things like ethnicity and gender and group (5 levels and 2 levels and 2 levels) are included in preliminary models, but none explain much variance. I've also looked at some interactions (marriage on SES) but those also don't show much significance. Only one time point.
With regard to a parsimonious model being "the best," it is merely my understanding that one attempts (through determining model significance via LRT) to achieve a balance between bias/variance among additional variables, thus, being less likely to over/under-fit the model. In this way, I believe that one can best also achieve a balance between an explanatory and predictive model. I'm certainly not steadfast in that; it is only what I have culled from experience.
Thanks much!