5

Normally in machine learning we will split our data into train, valid and test. The valid data is used to tune the parameters, and the test data is then used to check the performance of our best tuned model. (Watching out for notably different results on valid and test data.)

H2O's AutoML (similar to auto-sklearn, i.e. it is designed to automate finding the best algorithm and automate tuning it) offers me a leaderboard_frame, and it appears this is doing the same as the test data: it is not being used in either training, or model tuning, but is merely measuring the model performance.

So, should I give my test split as leaderboard_frame, or should I start splitting my data 4-ways, or should I not use the leaderboard_frame at all, and take AutoML's best model and evaluate it myself using test? If I do start passing test to leaderboard_frame, are there any extra precautions I should take? Any best practices?

Darren Cook
  • 1,772
  • 1
  • 12
  • 26

1 Answers1

5

According to the docs:

leaderboard_frame: This argument allows the user to specify a particular data frame to rank the models on the leaderboard. This frame will not be used for anything besides creating the leaderboard. If this option is not specified, then a leaderboard_frame will be created from the training_frame.

This means that the validation_frame is used to tune the hyperparameters on the individual models, and then leaderboard_frame is used to choose the winning tuned model. This choice makes the winning models score on the leaderboard_frame optimistically biased, for the same reason the validation set performance of the tuned model is always optimistically biased (you cannot use the same data set to make modeling choices and estimate the hold out error).

So, if you would like an unbiased estimate of the final, tuned model, then you still need a test set held out from the whole process.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • Thanks, good points. If the leaderboard was ordered based on the results of the validation frame, but showed the results on the leaderboard_frame as a separate column, that avoids it being optimistically biased, doesn't it? – Darren Cook Aug 04 '17 at 17:01
  • 1
    @DarrenCook The `validation_frame` is used to tune the individual models via early stopping methods, so it's not appropriate to use the same validation frame to score the individual models. That's why we use a separate holdout set, `leaderboard_frame`. – Erin LeDell Aug 08 '17 at 03:09
  • 4
    The reason we decided to use the term `leaderboard_frame` instead of `test_frame` for that holdout set is because we assumed that people would choose a model based on the leaderboard rankings. Matthew is exactly right that if you want to get an honest, unbiased estimate of the final, tuned model, then another test set is required. That said, the leaderboard score is a fairly accurate (though slightly biased) estimate of performance. – Erin LeDell Aug 08 '17 at 03:15
  • @ErinLeDell If I don't use AutoML, I will be choosing a model (and ranking my models) based on their performance on the validation set. I'll then be testing just the single best model on the test set. – Darren Cook Aug 08 '17 at 08:54
  • 1
    @DarrenCook AutoML will always pass a `validation_frame` to the algos (for early stopping), so if you want to be completely unbiased, you should not use the same `validation_frame` to rank your models & choose the best one. However, there is nothing stopping you from passing in the same hold-out set to the `validation_frame` and `leaderboard_frame` arguments if you still want to use the AutoML framework to do what you're doing by hand. – Erin LeDell Aug 08 '17 at 17:32