1

Many people suggest fine-tuning a network using Bayesian optimization ( or grid search or what every other black box optimization method you like ) so I tried it for my self. I am not sure about the following things:

  1. How long should I run the network at each iteration of the Bayesian optimization? - I chose to run it about a 10th the number of epochs I would till the network was fully trained.
  2. what should be the term which I optimize? I chose the minimum validation loss during the short training. Should I maybe fit my loss to some exponential decay function and try to estimate the loss at the end of training given that it remains a smooth learning curve.
  3. how many iterations should I run the Bayesian optimization given that I have about 15 hyperparameters that I am trying to tune ( most of which are continuous over a small range).

Any other advice would be much appreciated as well, Thanks, Dan

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Dan Erez
  • 231
  • 2
  • 5

1 Answers1

2

How long should I run the network at each iteration of the Bayesian optimization? - I chose to run it about a 10th the number of epochs I would till the network was fully trained.

Each iteration of BO should do whatever you do when you have a final parameter configuration. If you use early stopping, use early stopping. If you're training for a fixed number of epochs, do that.

what should be the term which I optimize? I chose the minimum validation loss during the short training. Should I maybe fit my loss to some exponential decay function and try to estimate the loss at the end of training given that it remains a smooth learning curve.

I don't know what this means.

how many iterations should I run the Bayesian optimization given that I have about 15 hyperparameters that I am trying to tune ( most of which are continuous over a small range).

What's your budget, in terms of computational resources, or time? What's your tolerance for having a sub-optimal network? I can't answer these questions for you. That being said, as a basis of comparison, oure random sampling has nice probability guarantees, so you might use 60 iterations as a baseline for comparison.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • thank you- great answer. What i meant by the second point is - instead of running the full training , run some portion of it and estimate the results at the end of the full training - I think this can be done since the learning is quite smooth in most cases. – Dan Erez Jun 18 '18 at 08:09