1

Random search has a probability of 95% of finding a combination of parameters within the 5% optima with only 60 iterations. Also compared to other methods it doesn't bog down in local optima.

This seems to be true (I have not experimented a lot yet); but I have a question regarding this: is this assumption always true, or does it change when dimensionality in the search space changes? Are there different formulae which take into account the dimensions?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
JoseLuis72
  • 11
  • 1

1 Answers1

5

The proof of your quotation is very straightforward, and the result does not depend on data dimension.

Random search is used in a scenario where we don't much about the function $f$ we seek to optimize. The function is allowed to be very general. In particular, we admit non-convex $f$ and functions $f$ for which the gradient is unavailable. However, we do know two things which allow us to make progress in optimizing $f$:

  1. We define a set $\mathcal{X}$ that we're searching within. Typically, this set is just some rectangle which we guess contains the optimal point somewhere within it. (Of course, if the cardinality of $\mathcal{X}$ is small, we can just try all $x\in\mathcal{X}$ and we'll know for certain the best value. But if $\mathcal{X}\subset\mathbb{R}^d$, it's impossible to enumerate all $x$, so we'll need a different approach.)
  2. We can query the function. That is, for some $x \in \mathcal{X}$, we can obtain the value $f(x)$.

Now, it's reasonable to ask "If I make $n$ queries of $f$ using $n$ randomly selected points in $\mathcal{X}$, what's the probability $p$ that one or more of my queries is among the largest $100(1-q)$ percent of all function values $f(x)$ such that $x\in\mathcal{X}$?" We can answer this definitively using some results from probability theory.

So we choose the the probability $p$ that you want your results to lie in some quantile $q$, and we can show that $n \approx 60$, as claimed in the quotation. This result does not depend on the dimension of $\mathcal{X}$.

Suppose your quantile is $q = 0.95$ and you want a $p=0.95$ probability that the model results are in top $100(1-q)=5$ percent of all hyper-parameter tuples. The probability that all $n$ attempted tuples are not in that window is $q^n = 0.95^n$ (because they are chosen independently at random from the same distribution), so the probability that at least one tuple is in that region is $1 - 0.95^n$. Putting it all together, we have

$$ 1 - q^n \ge p \implies n \ge \frac{\log(1 - p)}{\log(q)} $$

which in our specific case yields $n \ge 59$, whence your author rounded up to 60.


We can get some intuition about how random search works if we consider a special dartboard. I divide the dartboard into 100 equal-sized wedges, and number each wedge with a different integer number. Because each wedge has a unique integer, we know that exactly one of the wedges is labeled with the largest number. Likewise, the five largest numbers are in the top 5% by definition.

In this game, I'm throwing darts at a dartboard in a dark room, so I can't see the board. This means that when I throw a dart at the dartboard, it will land uniformly at random on the board. This is much like random search: we know there is a wedge labeled with the largest value somewhere on the board, but we can't see where it is because we're in the dark.

Now let's consider several games.

  • I throw $n=1$ dart at the board. The probability that I hit a wedge among the largest 5% of wedges is $\frac{5}{100}=1-0.95^1=0.05$.

  • I throw $n=20$ darts. The probability that one or more darts hits a wedge among the largest 5% of wedges is the same as the probability that all darts are not among the largest 5%: $1-0.95^{20}\approx 0.64151.$

  • I throw $n=58$ darts. The probability that one or more darts hits a wedge among the largest 5% of wedges is the same as the probability that all darts are not among the largest 5%: $1-0.95^{58}\approx 0.9489531.$

  • I throw $n=59$ darts. The probability that one or more darts hits a wedge among the largest 5% of wedges is the same as the probability that all darts are not among the largest 5%: $1-0.95^{59}\approx 0.9515.$

  • I throw $n=60$ darts. The probability that one or more darts hits a wedge among the largest 5% of wedges is the same as the probability that all darts are not among the largest 5%: $1-0.95^{60}\approx 0.95393.$

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • What does quantile mean in this context? I understand the point of having a 95% of probability (confidence) of finding at least one optimal tuple in the 5% (optimal) region. But I do not understand what quantile stands for. Wouldn't be easier to just know the optimal region and not a quantile (q) and then get that region? Because I understand that you want to have a 95% chance of getting the result from the 5%, not the other 95%. I don't get the "[...] the probability p that you want your results to lie in some quantile q [...]". – JoseLuis72 Nov 13 '20 at 07:04
  • 1
    Quantile has the usual meaning here; you could multiply by 100 and replace "quantile" with "percentile" and the meaning wouldn't change. You're correct, it would be much easier **if** we knew the best region ahead of time. In this scenario, one would just search *that* region and have 100% probability of discovering points in the optimal region *by construction*. But random search is designed to use in cases when we don't know anything about the optimal region, such as hyper-parameter search. – Sycorax Nov 13 '20 at 13:29
  • @Sycorax I think this has no sense at all. You say we don't know which the optimal region is; however, you calculate that region with ***q***. In a sense, you *must* know it. After all, it doesn't matter if you have a *quantile* or *optimal region* variable, as you'll have to calculate the region anyway. What is quantile then: the % of the region you want to explore on? Or the region you think the optimal values lie on? If you don't explain it properly, it's very difficult to get the point. Please, what does quantile stand for (simply explained)? – player1 Nov 13 '20 at 15:40
  • 3
    I believe my explanation of "quantile" in terms of percentiles is very straightforward. Whenever I find a word that I don't understand, I use a dictionary or another high-quality resource to look up the definition. If you've done this and still have a question about what a quantile is, you can use the Ask Question button at the top of the screen. I don't think that the claims you're making pertain to the usage of random search for hyper-parameter optimization, so this comment seems to be reasoning from a faulty premise. See my edit; I don't think I'll be able to help you further. – Sycorax Nov 13 '20 at 16:13
  • 1
    With throwing darts in the dark, the major problem would be to hit the board at all. This extends to great degree to using random search for hyperparameter tuning, since you are using it at random and you have no clue if you got even close to the board. – Tim Nov 13 '20 at 20:39
  • 1
    @Tim Indeed! And one way to deal with that problem is by making the dartboard even larger (choosing $\mathcal{X}$ with a larger cardinality or volume). Of course, if this larger dartboard also has a *lower value of its lower bound* among the largest $100(1-q)$ percent, then it's not clear that the resulting model is any better (for fixed choice of $p,q,n$), because the top percent might not be good enough for your purposes. – Sycorax Nov 13 '20 at 20:41