How to evaluate/interprete success fail for beta distribution

Question

Imagine we have two-armed Bandit with the prior binary distribution. How can we interpret that using beta distribution? meaning: which arm is the best arm to chose based on the prior?

arm 1: 5 successes 7 fails

arm 2: 50 successes 75 fails

What is the prior and what is the algorithm you are using for multiarmed-bandit? — Tim, Aug 24 '20 at 13:23
Imagine simple Thompson Sampling or any Bernoulli prior based algorithm. How should I interpret this, If after n trials we have the above results? — hamid.khb, Aug 24 '20 at 13:40
I guess you mean beta prior? Bernoulli rather couldn't be used as a prior in here. — Tim, Aug 24 '20 at 13:41
Bernoulli as the binary success/fail outputs. It's the same. — hamid.khb, Aug 24 '20 at 13:43
I'm not sure what you mean? Bernoulli is a prior for what exactly in here? — Tim, Aug 24 '20 at 14:26
It's not really important for the question. Bernoulli priors only mean that we have a binary output (as explained in the question). This can be used for any binary distribution like Beta. — hamid.khb, Aug 24 '20 at 14:29
Beta is not a binary distribution. "Distribution for outputs" is the likelihood, so I think it is clear now that you mean standard Thompson sampling, just the wording got bit unclear. — Tim, Aug 24 '20 at 14:30

Tim · Accepted Answer · 2020-08-25T09:34:44.583

You probably should start with reading more on Thompson sampling, e.g. this, or this Medium post, or this paper by Russo et al.

In standard Thompson sampling for multi-armed bandit with $K$ arms, you assume beta distribution for the probability of success $\theta_k$ per each $k$-th arm

$$ \theta_k \sim \mathsf{Beta}(\alpha_k, \beta_k) $$

The Thompson sampling algorithm first samples the $\theta_k$ probabilities independently per each arm, than you pick the winning arm by taking the one that has highest probability of success

$$ i = \operatorname{arg\max}_k \theta_k $$

next, you play the $i$-th arm, collect the reward (or not), and use it to update the Beta distribution, to obtain the posterior. In next round, you repeat the same procedure.

The Thompson sampling algorithm is a procedure that helps you with balancing exploration and exploitation, by choosing the arms at random, according to the distributions of rewards, and updating the distributions at each step.

Answering your question, the "currently best" arm given the data you've shown is the first arm, since it had 5/(5+7) * 100 = 41.67% success rate, while the second one 40% success rate. This means, that in next round it will have greater chance of being sampled.

On another hand, if you just want to explore the arms, and then exploit them, than maybe you could use other algorithm, e.g. explore-first (assign arms uniformly at random for $n$ rounds, then exploit the best one), or maybe epsilon-greedy, etc. If you ended up with the data you've show, I'm not surprised that you don't trust it, I would neither. When using Thompson sampling, the algorithm would start "correcting" itself at such stage and exploring the arm with less trials more, but if you stop at this stage than the result is not very conclusive. If you have limited budget, indeed epsilon-first strategy may be a wise option as discussed by Tran-Thanh et al (2010).

True but doesn't the fact that we have more data in the second arm should have any influence. I mean, you are describing the success ratio but it doesn't say anything about the confidence. — hamid.khb, Aug 24 '20 at 10:56
@hamid.khb I'm guessing here that you are using Thompson sampling, so this is the iterative procedure where at each step you adapt the assignments to the distributions based on previous steps. If the only thing you want to do is explore and then exploit, than just assign fixed number of cases to both arms and pick best, than there is no issue with sample sizes. — Tim, Aug 24 '20 at 11:17
Yes I did that to rank weights for deep network pruning and after T iterations I had successes and fails and wanted to know to interpret that to rank the weights. And at the end used last part of your solution (success ratio). Thanks. — hamid.khb, Aug 25 '20 at 08:57
@hamid.khb ok. I added errata to the answer, after your comment above this seems to be relevant. — Tim, Aug 25 '20 at 09:20
I'm actually comparing 5 algorithms for network pruning. I think what you added would be helpful for others.thanks. — hamid.khb, Aug 26 '20 at 10:09

How to evaluate/interprete success fail for beta distribution

1 Answers1