If the optimal policy is known to be stochastic (e.g. like in the stone, paper, scissors game), can this stochastic policy be found using SARSA or Q-learning, or is it only possible with policy gradient approaches?
1 Answers
Neither Q-learning nor SARSA define strictly how the policy should be derived from action-values, so you might be able to use them to learn some approximation to stochastic policies, if you used a variation of Gibbs (or Boltzmann) sampling.
However, I expect it would be an uphill battle. Q-learning especially is built around the idea of bootstrapping from a maximum value, so the assumption of a deterministic policy is built into the learning algorithm. At least with SARSA (or expected SARSA) you can feed in a real policy choice and optimise that - however, even there the assumption is built in that you should increase the probability of taking the current return maximising action. In a scissor-paper-stone scenario this may actually reduce the expected long-term return, so the learning algorithm may oscillate. It is likely to be sensitive to, and limited by, the hyper-parameters of how you make action selection, than you are not able to learn or change within the algorithm (compare to policy gradient methods, where parameters controlling action selection are directly modified by the optimisation).
The problem is that in value-based methods, you get no indication of how changing the policy will affect values. Instead it is implicit that selecting the highest values is always good. Note this is strictly true in fully-specified MDPs, which can be proven to have optimal deterministic policies. So it is not necessarily a bad thing.
Adversarial games like stone, paper, scissors are not strict MDPs unless you "freeze" the behaviour of an opponent (this opponent can still behave stochastically and react to your previous plays, but once their policy is fixed, then it is possible to optimise a deterministic policy to play against it). When you allow agents to react to each others' play and change both policies is when you need a stochastic policy, and when value-based learning algorithms will start to let you down.
I think you could probably get SARSA using Gibbs action sampling and a low learning rate to learn a correct behaviour for stone, paper, scissors. It would struggle though to learn policies where rewards were not even, e.g. modified stone, paper, scissors, where you score 1 for winning as stone, 2 for winning as paper and 3 for winning as scissors - this has an analytical solution, but you won't get SARSA to approximate it without very careful tuning of the temperature hyper-parameter.

- 6,089
- 20
- 24