1

I have a machine learning algorithm with some hyperparameters. First, I split the data to 70% (A-set) and 30% (B-set).

Then, I used 5-fold cross-validation on the A-set to find the best hyperparameters.

Finally, I used 10-fold cross-validation on all data for reporting the performance of the algorithm.

Was my approach correct? If yes, is there any reference for it?

Is my approach biased?

Thanks in advance.

moha
  • 111
  • 3
  • Possible duplicate of [How to split the dataset for cross validation, learning curve, and final evaluation?](https://stats.stackexchange.com/questions/95797/how-to-split-the-dataset-for-cross-validation-learning-curve-and-final-evaluat) – jpmuc Jul 23 '19 at 09:12

1 Answers1

0

This approach is not correct - hyperparameters are tuned on the data that is later used for evaluation. A better approach would be to only test on the set B using 10 fold.

sjw
  • 5,091
  • 1
  • 21
  • 45
  • 1
    Or, instead of a single split into A and B sets, go for nested cross validation, i.e. wrap the 10-fold CV around the optimized model training with the 5-fold-CV inside. – cbeleites unhappy with SX Jul 22 '19 at 10:38