2

I have a rather trivial doubt in SARSA and Q learning. Looking at the pseudocode of the two algorithms in Sutton&Barto book, I see the policy improvement step is missing.

How will I get the optimal policy by the two algorithms? Are they used to find only the optimal action values? In that case, at the end of training, should we iterate over all states to find the optimal policy by policy improvement theorem?

Jor_El
  • 391
  • 3
  • 9

1 Answers1

1

The current policy is derived in SARSA and Q learning from the current action values. It is always the $\epsilon$-greedy or greedy action choice according to $\text{argmax}_a Q(s,a)$. There is no need to have an explicit policy improvement step in that case.

This matches more closely to Value Iteration as opposed to Policy Iteration, but still follows the concept of generalised policy iteration.

Neil Slater
  • 6,089
  • 20
  • 24