Differences between Sarsa and Q-learning control procedural algorithms

Question

I am referring to pages 130-131 of Sutton and Barto book on Reinforcement Learning available here: book

I don't understand the slight difference that there is between the two procedural algorithms described respectively at page 130 for Sarsa and at page 131 for Q-learning.

Indeed, in the first case the $\varepsilon$-greedy choice of action $A$ is inside the loop for each episode but before of the loop for each step of the episode, while in the second one the $\varepsilon$-greedy choice of action $A$ is inside the loop for each step of the episode. Does this imply any real difference between the two algorithms (except the update rule for $Q(s,a)$ of course), or is this only a formal one?

To be more precise: can I move the the $\varepsilon$-greedy choice of action $A$ inside the loop for each step of the episode also in Sarsa algorithm?

score 0 · Accepted Answer · answered Mar 15 '19 at 17:20

No, we cannot. Otherwise, the already-determined next action $A'$ would be thrown away.

In SARSA, next action $A'$ is selected in the middle of current step loop, and it replaces $A$ in the next step (more precisely, at the end of current step). In other words, at the beginning of next step, $A$ should be the already-selected action $A'$ from current step, we cannot throw $A'$ away by selecting $A$ from a new ε-greedy search. On the other hand, in Q-learning, only one action is involved in each loop, thus, if the only action $A$ is selected at the beginning of each step, nothing is thrown away.

Thanks for your answer. How would you rewrite the pseudocode for SARSA algorithm in the case of a continuing task, where there is no initial state and action of the episode? — hardhu, Mar 19 '19 at 11:00

Differences between Sarsa and Q-learning control procedural algorithms

1 Answers1