Multi-armed bandit in face of full reward information

Question

I am new to this area of machine learning. I am just walking myself through UCB1 algorithm which seems to assume that the payoff can be learnt only for action that is taken. What I am curious about is if there is already analysis of the multi-armed bandit problem to the cases where the payoff can also be learnt for actions not taken (may be the term for this is "full reward vector"?), and that information used in subsequent steps.

I could modify UCB1 algorithm to update payoff for actions not taken as well. But, I would like to know if there are any drawbacks/pitfalls for doing this. I haven't done closed-form analysis of the regret bounds when reward vector is fully available.

I don't know what the terminology is for such cases. So, I have a hard time finding work on such cases. I understand that when full reward vector is available, it is a supervised reinforcement learning problem. What I am curious about is if multi-armed bandit work can be used well in supervised reinforcement learning problem or not. And whether it makes sense to do this or not. I will very much appreciate pointers to work/paper in this area.

Probably I miss something, but if the reward for every potential action is known, then why do you need multi-armed bandits? The optimal policy becomes trivial, just pick the best action... — iliasfl, Jul 20 '14 at 15:16
@iliasfl, if optimal policy is to pick best action, it will help to have a link to a paper with analysis to back it up. This question is to find out what are the possible approaches, and some analysis of pros and cons. — Sal, Jul 22 '14 at 02:03
Clarification: by "payoff can also be learnt for actions not taken", do you mean you have full knowledge of the reward distribution (in which case as @iliasfl points out the optimal policy is obviously to pick the best one), or that when you choose an action you get the reward for only that action but see what reward you would have gotten for all the other actions (in which case some analysis might be needed but the optimal strategy is probably still to pick the one with highest current expectation)? — Danica, Jul 22 '14 at 02:29
@Dougal, second one is correct - given the reward for the action we take, we can also estimate what reward we could have gotten for other actions we didn't take. So, there is more context information to make use of. I did some more research, and it seems it might be a form of contextual bandit problem. — Sal, Jul 23 '14 at 02:25

score 2 · Answer 1 · answered Jul 22 '14 at 02:20

If the reward for every potential action is known, then why do you need multi-armed bandits? The optimal policy becomes trivial, just pick the best action.

More details in "Reinforcement Learning" by Sutton & Barto, Section 2.1, p.26, paragraph 3.

I copy the passage of interest: "If you knew the value of each action, then it would be trivial to solve the n-armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates".

Multi-armed bandit in face of full reward information

1 Answers1