best practices for baysian optimization of hyper parameters DNN

Question

Many people suggest fine-tuning a network using Bayesian optimization ( or grid search or what every other black box optimization method you like ) so I tried it for my self. I am not sure about the following things:

How long should I run the network at each iteration of the Bayesian optimization? - I chose to run it about a 10th the number of epochs I would till the network was fully trained.
what should be the term which I optimize? I chose the minimum validation loss during the short training. Should I maybe fit my loss to some exponential decay function and try to estimate the loss at the end of training given that it remains a smooth learning curve.
how many iterations should I run the Bayesian optimization given that I have about 15 hyperparameters that I am trying to tune ( most of which are continuous over a small range).

Any other advice would be much appreciated as well, Thanks, Dan

score 2 · Accepted Answer · answered Jun 17 '18 at 20:13

How long should I run the network at each iteration of the Bayesian optimization? - I chose to run it about a 10th the number of epochs I would till the network was fully trained.

Each iteration of BO should do whatever you do when you have a final parameter configuration. If you use early stopping, use early stopping. If you're training for a fixed number of epochs, do that.

what should be the term which I optimize? I chose the minimum validation loss during the short training. Should I maybe fit my loss to some exponential decay function and try to estimate the loss at the end of training given that it remains a smooth learning curve.

I don't know what this means.

how many iterations should I run the Bayesian optimization given that I have about 15 hyperparameters that I am trying to tune ( most of which are continuous over a small range).

What's your budget, in terms of computational resources, or time? What's your tolerance for having a sub-optimal network? I can't answer these questions for you. That being said, as a basis of comparison, oure random sampling has nice probability guarantees, so you might use 60 iterations as a baseline for comparison.

thank you- great answer. What i meant by the second point is - instead of running the full training , run some portion of it and estimate the results at the end of the full training - I think this can be done since the learning is quite smooth in most cases. — Dan Erez, Jun 18 '18 at 08:09

best practices for baysian optimization of hyper parameters DNN

1 Answers1