I am new to Kaggle competitions and want to know if their are best practices for selecting a robust CV.
1 Answers
A common approach that balances well computational time and goodness of the estimate is choosing a number of folds between 5 and 10, and shuffling the data before splitting
Other things that you might want to check is how well represented your data is. For example, you might want to use stratified sampling when doing the splits in case you want to maintain the ratio of the classes in a feature constant over the folds. In case of very unbalanced data, an option is using some over/undersampling techniques also during the Cross Validation.
Finally, you must be extremely careful with time series data, as you have to be sure not to include lagged information in the variables that would allow for "look-ahead" bias. A common option here is usin "Roll Forward" CV, where you don't shuffle the data and use a growing threshold for the splitting of train/test.

- 2,305
- 8
- 24
-
Very helpful answer, but i mean more towards Kaggle. I would do my CV then post on to leaderboards... I'd much rather not need to post to leader boards after a certain amount of time because i know my CV is robust so i can evaluate internally. I hope that makes sense? – Kurtis Pykes Jan 07 '20 at 22:48