I am new to this area of machine learning. I am just walking myself through UCB1
algorithm which seems to assume that the payoff can be learnt only for action that is taken. What I am curious about is if there is already analysis of the multi-armed bandit problem to the cases where the payoff can also be learnt for actions not taken (may be the term for this is "full reward vector"?), and that information used in subsequent steps.
I could modify UCB1
algorithm to update payoff for actions not taken as well. But, I would like to know if there are any drawbacks/pitfalls for doing this. I haven't done closed-form analysis of the regret bounds when reward vector is fully available.
I don't know what the terminology is for such cases. So, I have a hard time finding work on such cases. I understand that when full reward vector is available, it is a supervised reinforcement learning problem. What I am curious about is if multi-armed bandit work can be used well in supervised reinforcement learning problem or not. And whether it makes sense to do this or not. I will very much appreciate pointers to work/paper in this area.