Sampling procedure to find distribution of maximal mean (pure exploration no exploitation)

Question

Given n distributions with unknown means, what finite sampling procedure could maximize the probability of finding the distribution with the highest mean?

More elaborately:

I have n sacks of coins. All the coins in a single sack show heads at the same probability. I have a total of finite k draws I can make from all the sacks. My goal is to identify the sack whose coins has the highest probability of showing head (I don't care about any other order statistics). I need to devise a sampling procedure that will accomplish that. The sampling procedure could be dynamic, i.e. I can decide which sack to sample in each round given all previous samples.

An example for a sampling procedure (which is not optimal) is to make n sampling rounds, such that at the ith round we sample the i sacks which had the highest means so far. Of course, it could be that by "bad luck" the first sack we throw away had the highest probability but we wouldn't know this because we won't sample it again.

If it makes the problem easier, we can:

Assume that the variances of all distributions are equal.
Know the distribution of means across the sacks.
Assume that the distributions are Normal.

I think this problem is similar to the secretary-problem but it is supposedly easier because we can "go back" to sampled sacks and resample them.

@Xi'an - absolutely, this is an n-arm bandit problem. However, we do not care much about the exploitation during the game but only on planning optimal exploration; in that sense that it will allow finding the arm with the highest expectancy. — Ohad Dan, Nov 23 '17 at 12:54
I fail to spot the distinction and the reason why you cannot apply an optimal bandit strategy. — Xi'an, Nov 23 '17 at 13:05
@Xi'an - (I think) that in classical bandit problems, maximization is of the total acquired reward whereas in my case, I only care about the ultimate conclusion of the optimal arm. Hence (I think) the maximization algorithm would be different. In any case, thanks for the reference - I'll read deeper inti the literature of the n-arm bandit. — Ohad Dan, Nov 23 '17 at 13:25

score 3 · Accepted Answer · answered Nov 30 '17 at 20:49

I believe this is a pure exploration bandit problem. Bubeck has a nice paper on it on arXiv. From the introduction:

Our setting is as follows. The forecaster may sample the arms a given number of times n (not necessarily known in advance) and is then asked to output a recommended arm. He is evaluated by his simple regret, that is, the difference between the average payoff of the best arm and the average payoff obtained by his recommendation.

They provide upper and lower bounds for a number of strategies. Which strategy is best depends on how many rounds are allowed, as they explain in their conclusions:

To make short the longer story described in this paper, one can distinguish three regimes, according to the value of the number of rounds n. The statements of these regimes (the ranges of their corresponding n ) involve distribution- dependent quantifications, to determine which n are considered small, moderate, or large.

For large values of n, uniform exploration is better (as shown by a combination of the lower bound of Corollary 2 and of the upper bound of Proposition 1).

For moderate values of n, sampling with UCB($\alpha$) is preferable, as discussed just above (and in Section Appendix A.2).

For small values of n, little can be said and the best bounds to consider are perhaps the distribution-free bounds, which are of the same order of magnitude for the two pairs of strategies

score 0 · Answer 2 · answered Jan 04 '18 at 10:21

Background

This problem has multiple names, amongst them are: “budgeted multi-armed bandit problem”, “Pure exploration in finitely–armed bandit” or "Best Arm Identification in Multi-Armed Bandits”.

This problem is concerned with the allocation of trials during some initial phase of the solution (evaluation\expoloration) in which there are no gains made from the trials; after the initial phase, the learner (forecaster, predictor, decision-maker) needs to make a single decision and choose the one distribution she evaluates as having the highest mean.

The canonical example for this problem is the contrast between products of the pharmaceutical and beauty industry. Let there be 10 products of different qualities from which a company needs to choose a single one to commercialize. The company has a budget for providing 1,000 participants with their products and needs to decide how to allocate the products to subjects so as to guarantee that by the end of the experimentation in the highest probability the best product would be identified. If these are 10 drugs, the pharmaceutical company is concerned with the results of their experiments during the experimentation phase - i.e. the results of the experiments are important on their own (and not only for making the final decision); that is, both exploration and exploitation should be optimized. In contrast, if the experimentation is of 10 hand-lotions, the company may decide that their participants gain from taking part in the experiment in any case (all hand-lotions are good) and so there are no implications for the results of the participants aside from helping the company choose the best product; that is, pure exploration.

As suggested by @combo (in the paper linked), some bounds on the probability for making the wrong choice could be proved regarding specific allocation methods (such as UCB). However many of these methods both require tuning of some meta-parameters which require knowledge on the distributions (namely, the difference in means between the best and second-best distributions) or suggest how many trials are required to guarantee a choice of the \epsilon-best distribution in 1-\delta chance, rather than suggesting how to allocate a predefined number of trials to a given number of distributions. However, the following paper does offer a solution to this problem directly without any requirements for prior-knowledge or assumptions on the data: Audibert, Jean-Yves, and Sébastien Bubeck. "Best arm identification in multi-armed bandits" COLT-23th Conference on Learning Theory-2010. 2010.‏

The answer

Successive rejects:

1.current_distribution_set = all_distributions

*batch sample from current_distribution_set and evaluate there means based on all past data.
remove from current_distribution_set the worst distribution.
Go back to 2 until current_distribution_set consists of a single distribution which is declared the winner.

The paper shows that the probability of choosing wrong (not the best distribution) shrinks exponentially with the number of trials. The quality of the procedure is provided by choosing right the batch size. The algorithm and batch sizes are described in the paper as follows:

Sampling procedure to find distribution of maximal mean (pure exploration no exploitation)

2 Answers2

Background

The answer