When is simple train/test split better than cross-validation or train/validate/test

Question

So, for the purpose of my master thesis I'm trying to predict pfofitabilty on times series data using Elastic net and XGBoost. I split the data 80/20 (50k instances, 3k+ features). I do not use cross validation (tried it, system crashed everytime) or a validation set. I train and tune the model on the training set and evaluate the performance on the test set.

As i have quite a lot of data I was wondering would my simple technique be appropriate or should I try to have an extra validation set for hyper parameter tuning? I would really appreciate any helpful answers or pointers to relevent papers.

score 2 · Accepted Answer · answered Apr 19 '17 at 16:24

It is generally a better practise to use cross-validation (e.g. 10-fold CV) that just a random split to your data. It would be even better if you could use CV and then test your model's performance on a completely independent validation test.You have enough instances to do the later.

Hope this helps.

score 2 · Answer 2 · answered Apr 19 '17 at 16:28

Gradient boosting has many hyper-parameters and if you have 50k instances I would suggest to use a validation set for hyper-parameters tuning.

"system crashed everytime" is not a valid excuse to not do something important. The solution is just "debug" and try other software packages (if you are using R, caret is a good one to try and it do almost everything for you).

When is simple train/test split better than cross-validation or train/validate/test

2 Answers2

Linked