What distributions to use for hyperparameter tuning?

Question

Where would one start when trying to figure out which distributions to use for hyperparameter tuning?

Libraries such as HyperOpt, Optuna, and sklearn (random search) ask not for uniformly distributed ranges, but for different probability distributions. I understand that what probability distribution one ends up using depends on the problem at hand and the algorithms used, but where does one start when trying to figure this out?

So far can’t find any tutorials on this, so any help would be appreciated.

You might be interested in derivative free optimisation for example the Nevergrad library. — tail_recursion, May 31 '21 at 13:33
@tail_recursion you still need to figure out the best distributions for hyperparameter values with that library, so the question still stands — ectoplasm, May 31 '21 at 15:42

Tim · Accepted Answer · 2021-05-31T15:00:56.833

You are correct, there is no single "agreed" way of doing so. The problem is quite similar to picking priors for a Bayesian model. Start with your prior knowledge about the possible values of the parameters:

Did your previous experiments suggest what could be the reasonable values of the parameters? Those values should have a higher probability under the distribution. You also should consider the range of the possible values, can you point the minimum and maximum, or if not, maybe say something like "there's a 95% chance that the value lies within the [a, b] interval", in such case 95% of the probability mass of the distribution should cover the region.
Maybe you can find some papers describing what values of hyperparameters worked well? Give them extra points based on how similar was their experimental setup to your case, the more similar, the higher probability can you assign to the values.
You can ask experts or your colleagues and use those answers to come up with the distribution. For example, if many people say that the parameter should be close to $x$, this should probably be the mode of the distribution, etc.
In general, think of the distribution in terms of a subjective probability, so the range of the values that you have reasons to believe to be better should have a higher probability under the distribution.
You can just use uniform distributions, just that using a distribution that accumulates relatively more probability mass over reasonable values would make trying them more often, hence the optimization would be more efficient if you started with a reasonable guesses.

I see. It's more complicated than I thought. So far I've been using uniform distributions and then visualising all hyperparameter values to restrict my searches. I can probably start with using particular probability distributions based on those visualisations. — ectoplasm, May 31 '21 at 15:47
@ectoplasm running Bayesian optimization with uniform distributions to find ranges of parameters to run it again with the new values is not the best idea. The algorithm would do this by itself, likely better than you, so instead just use more optimization steps. — Tim, May 31 '21 at 15:59
I meant it may give me some kind of idea what probability distribution should be used. However, now thinking about it I don't see how looking at visualisations will help me with that. — ectoplasm, May 31 '21 at 16:53

What distributions to use for hyperparameter tuning?

1 Answers1