Imagine we have two-armed Bandit with the prior binary distribution. How can we interpret that using beta distribution? meaning: which arm is the best arm to chose based on the prior?
arm 1: 5 successes 7 fails
arm 2: 50 successes 75 fails
Imagine we have two-armed Bandit with the prior binary distribution. How can we interpret that using beta distribution? meaning: which arm is the best arm to chose based on the prior?
arm 1: 5 successes 7 fails
arm 2: 50 successes 75 fails
You probably should start with reading more on Thompson sampling, e.g. this, or this Medium post, or this paper by Russo et al.
In standard Thompson sampling for multi-armed bandit with $K$ arms, you assume beta distribution for the probability of success $\theta_k$ per each $k$-th arm
$$ \theta_k \sim \mathsf{Beta}(\alpha_k, \beta_k) $$
The Thompson sampling algorithm first samples the $\theta_k$ probabilities independently per each arm, than you pick the winning arm by taking the one that has highest probability of success
$$ i = \operatorname{arg\max}_k \theta_k $$
next, you play the $i$-th arm, collect the reward (or not), and use it to update the Beta distribution, to obtain the posterior. In next round, you repeat the same procedure.
The Thompson sampling algorithm is a procedure that helps you with balancing exploration and exploitation, by choosing the arms at random, according to the distributions of rewards, and updating the distributions at each step.
Answering your question, the "currently best" arm given the data you've shown is the first arm, since it had 5/(5+7) * 100 = 41.67% success rate, while the second one 40% success rate. This means, that in next round it will have greater chance of being sampled.
On another hand, if you just want to explore the arms, and then exploit them, than maybe you could use other algorithm, e.g. explore-first (assign arms uniformly at random for $n$ rounds, then exploit the best one), or maybe epsilon-greedy, etc. If you ended up with the data you've show, I'm not surprised that you don't trust it, I would neither. When using Thompson sampling, the algorithm would start "correcting" itself at such stage and exploring the arm with less trials more, but if you stop at this stage than the result is not very conclusive. If you have limited budget, indeed epsilon-first strategy may be a wise option as discussed by Tran-Thanh et al (2010).