Background
This problem has multiple names, amongst them are: “budgeted multi-armed bandit problem”, “Pure exploration in finitely–armed bandit” or "Best Arm Identification in Multi-Armed Bandits”.
This problem is concerned with the allocation of trials during some initial phase of the solution (evaluation\expoloration) in which there are no gains made from the trials; after the initial phase, the learner (forecaster, predictor, decision-maker) needs to make a single decision and choose the one distribution she evaluates as having the highest mean.
The canonical example for this problem is the contrast between products of the pharmaceutical and beauty industry. Let there be 10 products of different qualities from which a company needs to choose a single one to commercialize. The company has a budget for providing 1,000 participants with their products and needs to decide how to allocate the products to subjects so as to guarantee that by the end of the experimentation in the highest probability the best product would be identified.
If these are 10 drugs, the pharmaceutical company is concerned with the results of their experiments during the experimentation phase - i.e. the results of the experiments are important on their own (and not only for making the final decision); that is, both exploration and exploitation should be optimized. In contrast, if the experimentation is of 10 hand-lotions, the company may decide that their participants gain from taking part in the experiment in any case (all hand-lotions are good) and so there are no implications for the results of the participants aside from helping the company choose the best product; that is, pure exploration.
As suggested by @combo (in the paper linked), some bounds on the probability for making the wrong choice could be proved regarding specific allocation methods (such as UCB). However many of these methods both require tuning of some meta-parameters which require knowledge on the distributions (namely, the difference in means between the best and second-best distributions) or suggest how many trials are required to guarantee a choice of the \epsilon-best distribution in 1-\delta chance, rather than suggesting how to allocate a predefined number of trials to a given number of distributions. However, the following paper does offer a solution to this problem directly without any requirements for prior-knowledge or assumptions on the data: Audibert, Jean-Yves, and Sébastien Bubeck. "Best arm identification in multi-armed bandits" COLT-23th Conference on Learning Theory-2010. 2010.
The answer
Successive rejects:
1.current_distribution_set = all_distributions
*batch sample from current_distribution_set and evaluate there means based on all past data.
remove from current_distribution_set the worst distribution.
Go back to 2 until current_distribution_set consists of a single distribution which is declared the winner.
The paper shows that the probability of choosing wrong (not the best distribution) shrinks exponentially with the number of trials.
The quality of the procedure is provided by choosing right the batch size.
The algorithm and batch sizes are described in the paper as follows:
