Model building with 10-fold validation

Question

I have yet to find a sufficient and succinct answer regarding model building with 10-fold cross validation (in this case, using Caret). I've found responses here, for instance: https://stackoverflow.com/questions/33470373/applying-k-fold-cross-validation-model-using-caret-package, and here: How to choose a predictive model after k-fold cross-validation?.

Would one build a model with all of the data, using R^2 (in this case, I'm just doing simple multiple regression) and likelihood ratio tests to determine the best (parsimonious) model, and then run a 10-fold validation? I've done train/test cross-validation, where one would build a model with, say, 70% of data, and then obtain an RMSE by comparing model predictions (from test dataset) w/ actual observations, but it isn't immediately clear to me how this translates to something like a 10-fold--and whether one, again, in the case of 10-fold, would just build the model with the entire dataset, vs a subsection.

After determining the best model, one would implement a package like Caret

train_control<- trainControl(method="cv", number=10)

model<- train(resp~ (all variables included in final model), data=mydat, trControl=train_control, method="rpart")

and then one would just obtain an RMSE as usual by model$pred and comparing this with actual values?

In this particular instance, I have a final sample of 350 participants. These are from the Camp Fire wild disaster in California in 2018, looking at factors like emotional support, family support, services, etc. (roughly 12 variables of interest, ~6 of which remain if I were to use all data) as a function of well-being. Things like ethnicity and gender and group (5 levels and 2 levels and 2 levels) are included in preliminary models, but none explain much variance. I've also looked at some interactions (marriage on SES) but those also don't show much significance. Only one time point.

With regard to a parsimonious model being "the best," it is merely my understanding that one attempts (through determining model significance via LRT) to achieve a balance between bias/variance among additional variables, thus, being less likely to over/under-fit the model. In this way, I believe that one can best also achieve a balance between an explanatory and predictive model. I'm certainly not steadfast in that; it is only what I have culled from experience.

Thanks much!

Please say more about why you consider the most parsimonious model to be the best. If you were interested in prediction that probably would not be the case. Also, how large is your data sample and how many predictors are you evaluating (including extra levels of categorical predictors, and interaction terms)? The answer to your question might depend on your reply. Please add that information by editing the question itself, as comments are often overlooked and sometimes get lost. — EdM, Nov 19 '20 at 21:47
You definitely don’t want to do k fold validation after selecting the model. The purpose (well one of them) of k fold cross validation is to select the model — astel, Nov 19 '20 at 22:03
Updated my question to reflect EdM's comment. @astel, so what you say is one of my points of confusion. From what I've read, it seems that one might want to compare two models using different approaches (say linear vs neural network) and thus, one would use k-fold to determine the best approach but model selection of the variables themselves would be separate from this and not involve k-fold. — James, Nov 19 '20 at 22:09
@James Certainly the variable selection could be part of that. What makes you think otherwise? — Dave, Nov 19 '20 at 22:15
Think of it this way. Choosing between a regression with three variables and a regression with four variables is exactly the same as choosing between a regression with three variables and a neural network. You choose the best performing model either way. How do you decide what the best performing model is? K fold cross validation. — astel, Nov 19 '20 at 22:15
@Dave, "what would make me think otherwise?"...just what I've read and thinking how one would engage in regular cross-validation (with 70% train dataset); where one would build a model with that dataset...one wouldn't really build a model (selecting variables) by obtaining an RMSE and then deciding to add another variable. The RMSE would just clarify whether or not the model is sufficient. With that in mind, I could see one using the RMSE (if poor) to determine whether or not one should take a different approach (i.e. neural network). — James, Nov 19 '20 at 22:32
@astel, I could only include one "@" above, but I see your response as similar to Dave's. I may be wrong (or not completely right), but the above is my understanding. I think all of this highlights my confusion in moving from regular cross-validation to a k-folds approach. — James, Nov 19 '20 at 22:33
“Regular” cross validation and k fold cross validation are the same thing. K fold is just regular done many times and averaged to reduce the variance in your estimate — astel, Nov 19 '20 at 22:39
Thanks, Astel, yes, I understand that they serve the same purpose; while it is totally clear to me how one would build a model with 70% data, it isn't clear to me how one would build a model with k-fold (maybe my confusion here is specifically with the Caret package and if I were to look at a function for k-fold vs using the Caret package, it would be clearer?). — James, Nov 19 '20 at 22:45

Jonas Lindeløv · Accepted Answer · 2020-11-19T23:02:59.733

It makes a difference whether you want to maximize the predictive accuracy of future data or infer parameters from at-hand data.

Predicting future data

If your goal is out-of-sample prediction, then it makes sense to evaluate the model on out-of-sample prediction (e.g., cross-validation). Simple as that. The R^2 on a "regular" regression model is in-sample.

Cross-validation is a nice out-of-sample method because you make the most of the data: every data point is used both for training and validation. You would use cross-validation, both for parameter selection and for hyperparameter tuning, e.g., if you chose a more flexible multiple regression model like elastic net.

Once you find a good model/hyperparameters, you can train it on all your available data and use that model for prediction. The new data would then be just like your hold-out data in the CV, so your CV above helped you learn about the performance of your model in this scenario.

Making inferences from existing data

While frequentist methods are popular (p-values, confidence intervals, likelihood-ratio tests), I'd recommend going Bayesian for better interpretability of the resulting parameter values (posterior distributions). Model selection for Bayesian regression models is a field in heavy development, but take a look at the rstanarm or brms packages for regression. They both support k-fold and leave-one-out cross-validation for model comparison (via the loo package), which makes a lot more sense for inference in a Bayesian setting, because the posteriors represent possible worlds rather than just the single most likely world as frequentist does.

Notes on cross-validation

CV need not be 10-fold. It can be 11-fold or 5-fold or something else. At the other end of the spectrum, you have leave-one-out (LOO) CV.
While RMSE is a common index for fit, you could easily have situations where another metric better represents your goal. I often work with problems where the cost of error is linear, not quadratic, and so I use mean absolute error - both as the cost function in the algorithm (regression etc.) and at the model selection level (aggregate of out-of-sample performance).
CV is only valid if you assume all folds to have essentially the same properties (you can help this along with stratification), and that your training set is representative of the yet-to-be-seen data you want to predict. This is an issue with time series where a leave-future-out (e.g., "rolling window" CV or "walk-forward" CV) method is often more appropriate.

Thanks, Jonas. I'm getting more confused, but maybe closer to understanding. So, yes, it would make sense to evaluate the model using out-of-sample, but you'd still build the model using in-sample, correct? Re 4th paragraph: wouldn't finding a "good" model be the training of the model? And wouldn't you already have your predictions from the k-fold validation? Re the final paragraph; yep, I'm (somewhat) familiar with different approaches: i.e. LOO and different metrics; thank you for your example of when you use MAE. — James, Nov 19 '20 at 22:50
@James I've updated my reply to answer both of your questions. If your goal is prediction, then yes, training on all data is OK because you will predict on out-of-sample data anyway, just as you did in CV. But I got the suspicion that you may want to just learn about relationships in your data, so I added a section about that too. — Jonas Lindeløv, Nov 19 '20 at 23:06
Thanks. I appreciate the clarifications. I've used STAN (with rStan) a bit and have explored BRMS, but have never used rstanarm, though I'd still consider myself a comparative beginner with Bayesian techniques. Re whether the focus is on predictive accuracy vs inference, it's good to clarify--it exacerbates confusion, because it isn't always clear whether one considers these in answering a question. I've found this to be a good paper: https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf for anyone looking through this post. — James, Nov 20 '20 at 00:22
Still, in better trying to understand the nuances of the predictive/explanatory relationship--would you agree that one should only bother looking at inferences within the context of a model that meets some threshold predictive capacity (RMSE, MAE, etc value)? Or is there an argument to wave aside predictive capacity when looking at inference? — James, Nov 20 '20 at 00:27
Predictive accuracy and inferential unambiguity should be closely related. Broad posteriors will often correspond to weak predictive accuracy. So I'd say that for inference, you just learn from your posteriors no matter their shape or width, as long as the model seems appropriate. You just learn very little (change credence in parameter values very little) if the data is uninformative and the prior-posterior relationship quantifies how much you've learned. — Jonas Lindeløv, Nov 20 '20 at 08:39

Model building with 10-fold validation

1 Answers1

Predicting future data

Making inferences from existing data

Notes on cross-validation