3

So, for the purpose of my master thesis I'm trying to predict pfofitabilty on times series data using Elastic net and XGBoost. I split the data 80/20 (50k instances, 3k+ features). I do not use cross validation (tried it, system crashed everytime) or a validation set. I train and tune the model on the training set and evaluate the performance on the test set.

As i have quite a lot of data I was wondering would my simple technique be appropriate or should I try to have an extra validation set for hyper parameter tuning? I would really appreciate any helpful answers or pointers to relevent papers.

ljourney
  • 143
  • 4

2 Answers2

2

It is generally a better practise to use cross-validation (e.g. 10-fold CV) that just a random split to your data. It would be even better if you could use CV and then test your model's performance on a completely independent validation test.You have enough instances to do the later.

Hope this helps.

Mati
  • 46
  • 6
2

Gradient boosting has many hyper-parameters and if you have 50k instances I would suggest to use a validation set for hyper-parameters tuning.

"system crashed everytime" is not a valid excuse to not do something important. The solution is just "debug" and try other software packages (if you are using R, caret is a good one to try and it do almost everything for you).

Haitao Du
  • 32,885
  • 17
  • 118
  • 213