In reinforcement learning/multi-armed bandits, why do we look at expected reward and not the most likely reward?

Question

This is the dilemma that I have faced in applied probability in general. Say you have the choice to put your savings of $\$10$ in a deposit account with guaranteed retun of $\$100$ or buy a lottery ticket with that money where you can win $\$100,000$ with a probablity of 0.01 and lose everything with a probablity of 0.99. Some people mighht say the expected winnings are $\$100$ in the first case and $\$1000$ in the second case so we should buy the lottery. Some might say since we are highly likely to lose everything in the second case and hence we should chose the first option.

In this context, what is the right choice for RL and multi-armed bandit problems?

Does this answer your question? [Why care so much about expected utility?](https://stats.stackexchange.com/questions/313290/why-care-so-much-about-expected-utility) - (although a viable objection is that $ or "score in a game" is often not a good proxy for utility) — shimao, Dec 03 '21 at 01:49

score 1 · Answer 1 · answered Dec 02 '21 at 12:14

These are averages... so you should interpret them as such. In the limit, the second option is better as it yields higher returns. However, if you were to repeat this experiment only once then the first options is probably better/safer. Since in RL (as is the case in many other tasks) we are interested in the long run results we use average rewards.

In reinforcement learning/multi-armed bandits, why do we look at expected reward and not the most likely reward?

1 Answers1