Posterior sampling for bandit

Question

I am looking at this paper on posterior sampling.

The algorithm is on page 8 (image below):

Let’s say I have 3 arms and on line 22 arm 3 is the best followed by arm 2 then arm
Line 24 calculates the number of candidate arm sample to draw. There is a sum over all the sub-optimal arms to produce N_t+1. Let’s say that turns out to be 100.
How are lines 25 and 26 calculated?
When it says “Draw N_t+1 candidate arm samples” I will draw 100 samples but from what distribution? Cat(…) takes as it’s argument the probability that arm “a” will be the optimal arm at time t+1 but how does Cat(…) lead to selecting an arm 1,2 or 3? I am confused as to what Cat(…) is doing. Can you explain with this example?

Basically what is $$Cat(p_{{\hat{a}},t+1})$$ doing in line 25 and what does it produce that is used in line 26?

Then on Line 26 it looks like there will be a sequence of arm numbers 1111…2222…33333 and the mode will be selected as the arm to play?

This is from page 8:

score 1 · Answer 1 · answered Dec 16 '17 at 01:32

My best guess is that Cat() stands for the categorical distribution, where category $k$ corresponds to the event "arm $k$ is optimal". The probability category $k$ is drawn is given explicitly in Equation 11.

In line 25, we sample a vector of $N+1$ realizations from the categorical distribution.

I think the authors have several typos in Line 26, which is where a lot of confusion comes from. In particular, $\hat{a}_{t+1}^{(n)}$ is a scalar and so computing it's mode does not make sense. Based on other posterior sampling approaches, I think line 26 ought to be: $$a_{t+1} = \text{Mode}(\hat{a}_{t+1}).$$ In this case we can think of lines 25 and 26 as an election, where votes are cast (line 25) and then $a_{t+1}$ is set to an arm with the most votes (line 26).

Example:

In your example, the relative ranking of arms 1,2,3 influences how likely we are to see 1,2, and 3 in $\hat{a}_{t+1}$. Since $\hat{p}_{3,t+1} > \hat{p}_{2,t+1} > \hat{p}_{1,t+1}$, we should see more 3s than 2s and more 2s than 1s (on average).

In code you could write line 25 as

r = rand(100)
a_hat = ones(100);
a_hat[find(r <= p_2)] = 2
a_hat[find(r > p_2 && r <= p_3)] = 3

which generates a vector of 100 numbers such that the probability that an entry is 1, 2, or 3 is $p_1$, $p_2$ or $p_3$, respectively.

Now in line 26 we tally the "votes" to find which arm to pull. An example in code (though you should just use the mode function in practice):

votes = zeros(3)
for n=1:100
   votes[a_hat[n]] += 1
end
action = indmax(votes)

Posterior sampling for bandit

1 Answers1

Example: