Choosing the Best Performing Model when Test Set MSE's Are Highly Variable?

Question

I'm currently building an XGBoost model to predict sales for a certain line of products. I'm using Caret's train function with 10-fold cross validation to fine tune the model's hyper-parameters. The issue I currently face is that I only have 24 data points to work with, so I'm experience variance issues.

At the moment, I'm trying to determine which features I should add to the model and I don't know how to go about it. That's because each possible model's performance on the test set is variable when changing seeds. For example, with a seed of 777, one specific model has a test RMSE of 140, but on a different seed it has a test RMSE of 400.

Exactly how should I go about selecting the best model? My idea was to use the model with the lowest Training RMSE derived from the 10-fold cross validation. Any ideas?

How are you getting 10 folds out of 24 data points? Or do you mean you have 240 data points in total? Is it possible that you simply don't have enough data to train a decent XGBoost model? — Dan, Jul 10 '18 at 13:12
@Dan: 10 folds is not a problem. It just means that each out-of-bag evaluation is done on only two or three data points. Hence high variability. — Stephan Kolassa, Jul 10 '18 at 13:13
@StephanKolassa training an ensemble model on 22 points of data may well be a problem :/ Also 10 folds is probably too many for such a small data set as the number of folds does have an impact on the [bias variance trade-off](https://machinelearningmastery.com/k-fold-cross-validation/) which is clearly an issue in this case. I wouldn't go for more than 3 folds here. — Dan, Jul 10 '18 at 22:53
This is the sad conclusion I came to yesterday. I wish I could get experience implementing machine learning for this problem, but it's just a bad approach. I'm just going to have to stick to EDA for this problem and use Bayesian inference. — George Thompson, Jul 11 '18 at 13:19

score 1 · Answer 1 · answered Jul 10 '18 at 13:11

1

The "one standard error rule" is commonly recommended. I'll just copy over what I wrote at that thread. Click over there for references and other views.

Assume we consider models $M_\tau$ indexed by a complexity parameter $\tau\in\mathbb{R}$, such that $M_\tau$ is "more complex" than $M_{\tau'}$ exactly when $\tau>\tau'$. Assume further that we assess the quality of a model $M$ by some randomization process, e.g., cross-validation. Let $q(M)$ denote the "average" quality of $M$, e.g., the mean out-of-bag prediction error across many cross-validation runs. We wish to minimize this quantity.

However, since our quality measure comes from some randomization procedure, it comes with variability. Let $s(M)$ denote the standard error of the quality of $M$ across the randomization runs, e.g., the standard deviation of the out-of-bag prediction error of $M$ over cross-validation runs.

Then we choose the model $M_\tau$, where $\tau$ is the smallest $\tau$ such that

$$q(M_\tau)\leq q(M_{\tau'})+s(M_{\tau'}),$$

where $\tau'$ indexes the (on average) best model, $q(M_{\tau'})=\min_\tau q(M_\tau)$.

That is, we choose the simplest model (the smallest $\tau$) which is no more than one standard error worse than the best model $M_{\tau'}$ in the randomization procedure.

answered Jul 10 '18 at 13:11

Stephan Kolassa

95,027
13
197
357

Ok I have a few questions. First, what exactly does τ quantify, the model's features? When would τ>τ′? Also, to denote the standard error s(M) across randomization runs, is each run when I change the seed to select the training and test set? Also, if Mτ′ is the best model on average, why wouldn't I use that model because wouldn't Mτ and Mτ′s accuracy both be variable? Is this just to smooth the risk since the best model's standard error could lead to worse results if a the model was fitted improperly? – George Thompson Jul 10 '18 at 14:15
$\tau$ is a general complexity parameter. The easiest example is the number of predictors, so a model $M_\tau$ with $\tau=7$ predictors would be more complex than $M_{\tau'}$ with only $\tau'=5$ predictors. (Assuming the models are nested.) And yes, the standard error refers to re-randomizing the entire process. Why you wouldn't simply pick the model with the lowest average error: [check the question I linked](https://stats.stackexchange.com/q/80268/1352). Essentially, the higher variance from the more complex model may dominate the lower mean via the bias-variance tradeoff. – Stephan Kolassa Jul 10 '18 at 14:19
Ok this is a little more clear, obviously I have more to learn. It takes a while to fit the model, so I ran a few seeds and a specific model always had a lower cross-validated Training MSE than the others. Standard errors were about the same acorss all models. Although you want to test a model on test data, would this approach be suitable, since the training data is essentially being shuffled from cross validation? Thus, if the model always proves to have the lowest training MSE, it most likely will have a lower mean MSE given most possible data combinations to test/train on? – George Thompson Jul 10 '18 at 14:29
Yes, that's the entire point of cross-validation. (It is still good to validate your approach on true out-of-sample test data, since CV may tempt you to "overfit on the out-of-bag data".) If a particular model *always* gives the lowest error on the *same* out-of-bag data, then this seems to be systematic. Note that the one SE rule does not account for the fact that each model is evaluated on the *same* out-of-bag sample! So in your case, it makes sense to go with the optimal model and disregard the one SE rule. – Stephan Kolassa Jul 11 '18 at 06:40

score 0 · Answer 2 · answered Jul 10 '18 at 13:34

I would fit a Gaussian Process or similar to your validation errors and then select hyper-parameters at either the maximum of the fitted GP or where the fitted mean + 1 standard deviation has a maximum.

You might be able to reduce you variance by running a repeated K-fold cross validation instead of just a single repetition. Repeated K-fold consists of normal K-fold cross validation where you then shuffle the data and do another K-fold etc. So for 3 repeats of 5 fold CV, you would have a total of 15 folds.

Choosing the Best Performing Model when Test Set MSE's Are Highly Variable?

2 Answers2