States in Bandit Problems

Question

I am wondering if there is an interpretation of the Bandit Problem with more than one states. I know that there are versions which views each slot machine as an independent Markovian machines and as such the states evolve when an arm is pulled.

However, I do not seem to find any discussions about incorporating states that is more or less based on the player's psychological/belief state. What I mean is that there should be some sort of distinction between the scenario where I have won \$5000 after ten trials and the scenario when I have lost \$5000 after ten trials. The way I I see it, whether or not I have won or lost bunch of money would certainly affect how I would make decisions.

The lack of these sort of variations of the Bandit Problem seems to imply that they are not particularly useful or practical, so I would very much appreciate if someone shed some light into why.

For the basic bandit problem the time horizon is infinite so losses at the beginning get averaged out. What you're describing sounds a lot like risk-aversion - have you looked into risk-averse multi-armed bandits? — combo, May 12 '20 at 16:45

score 3 · Accepted Answer · answered Jun 30 '20 at 20:54

There are many ways to think extensions of MAB that may model what you describe.

A first level of generalisation would be to consider a Contextual MAB problem, where at the beginning of each round you observe a context $x_t$, you choose an action $a_t$ and observes a reward $r_t = r(x_t,a_t)$ that depends on both your context and action. This models usually assume that the context are drawn iid from a fixed distribution (Stochastic MAB) or are chosen in advance by an adversary (adversarial CMAB)

A more general setting is the RL setting, modelled by a Markov Decision Process (MDP). Here you also have a (stochastic) state evolution, i.e. you have a state transition probability, that detemines the next state based on your current state and the action that you take.

This last setting (MDP) is able to perfectly describe the scenario that you mention: "5000 \$ after ten trials and the scenario when I have lost 5000 \$ after ten trials".

States in Bandit Problems

1 Answers1