8

I would like to ask if someone can refer to me the paper containing the proof of convergence of $Q-$learning/SARSA (either/both), one of the learning algorithms in reinforcement learning.

The iterative algorithm for SARSA is as follows:

$$ Q(s_t, a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)],$$ where $r$ is the reward, $\gamma$ is the discount factor, $s$ is the state, $a$ is the action, $t$ is the time.

I have seen this work many times, but never understood why this works.

Can someone explain/give an insight why this algorithm converges and $Q-$learning/SARSA is possible?

Thanks

Paul
  • 9,773
  • 1
  • 25
  • 51
cgo
  • 7,445
  • 10
  • 42
  • 61
  • 1
    See [Watkins & Dayan (1992)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7501&rep=rep1&type=pdf) for detailed proof of convergence of Q-learning or Watsons PhD thesis [Learning from delayed rewards](https://www.researchgate.net/profile/Christopher_Watkins2/publication/33784417_Learning_From_Delayed_Rewards/links/53fe12e10cf21edafd142e03.pdf) (page 220). The second one was more understandable for me and I can recommend reading it. – Jan Vainer Aug 27 '17 at 16:01
  • It's all in the [Sutton & Barto](http://incompleteideas.net/sutton/book/the-book-2nd.html) textbook. – Don Reba Jul 19 '17 at 13:38
  • There is a proof for Q_learning in proposition 5.5 in the book Neuro-dynamic programming, Bertsekas and Tsitsiklis. Sutton and Barto refers to Singh, Jaakkola, Littman, and Szepesvari (2000) for the proof of SARSA. – user220743 Sep 16 '18 at 06:11

1 Answers1

10

The SARSA algorithm is a stochastic approximation to the Bellman equations for Markov Decision Processes. One way of writing the Bellman equation for $q_{\pi}(s,a)$ is:

$$q_{\pi}(s,a) = \sum_{s',r}p(s',r|s,a)( r + \gamma \sum_{a'}\pi(a'|s') q_{\pi}(s',a'))$$

Where $p(s',r|s,a)$ is the probability of transition to state $s'$ with reward $r$ given current state $s$ and action $a$.

In Dynamic Programming solutions to MDPs, the Bellman equation is used directly as an update mechanism that converges to correct values for $q_{\pi}(s,a)$. For estimating the value of a fixed policy, this works because the only stationary points of these updates are when the two sides are equal, and there is one equation for each $q_{\pi}(s,a)$ relating it linearly to one or more other $q_{\pi}(s',a')$, so it has a computable solution.

When you add the search for an optimal solution, such as in SARSA, then the policy $\pi$ changes. There is a proof that changing the policy to pick the action $\pi'(s) = argmax_a q_{\pi}(s,a)$ will always either improve the policy or be optimal. This is called the Policy Improvement Theorem, and is based on the inequality $v_{\pi}(s) \le q_{\pi}(s,\pi'(s))$ - there is an extension to the theorem that covers $\epsilon$-greedy policies used in learners such as Monte Carlo Control or SARSA.

TD learning, including SARSA and Q-Learning, uses the ideas of Dynamic Programming in a sample-based environment where the equalities are true in expectation. But essentially you can see how the update $q_{\pi}(s,a) = \sum_{s',r}p(s',r|s,a)( r + \gamma \sum_{a'}\pi(a'|s') q_{\pi}(s',a'))$ has turned into SARSA's update:

  • The weighted sum over state transition and reward probabilities happens in expectation as you take many samples. So $Q(S,A) = \mathbb{E}[ \text{Sampled}(R) + \gamma \sum_{a'}\pi(a'|S') q_{\pi}(S',a')]$ (technically you have to sample R and S' together)

  • Likewise the weighting of the current policy happens in expectation. So $Q(S,A) = \mathbb{E}[ \text{Sampled}(R + \gamma Q(S',A'))]$

  • To change this expectation into an incremental update, allowing for non-stationarity as the policy improves over time, we add a learning rate and move each estimate towards the latest sampled value: $Q(S,A) = Q(S,A) +\alpha[R + \gamma Q(S',A') - Q(S,A)]$

For a more thorough explanation of the building blocks of algorithms like SARSA and Q-Learning, you can read Reinforcement Learning: An Introduction. Or for a more concise and mathematically rigorous approach you can read Algorithms for Reinforcement Learning.

Hans
  • 855
  • 4
  • 14
Neil Slater
  • 6,089
  • 20
  • 24