4

I am currently trying to find a formal proof of convergence for the Monte Carlo Reinforcement Learning Methods described in Sutton,Barto's Book "Reinforcement Learning - An Introduction" , Section 5.

They explain that along the ideas of generalized policy iteration, one obtains a uniformly better (or equally good) policy in each step of the iteration. However, all of this is under the assumption that the exact value functions are known.

I have been looking for a formal proof of convergence to the optimal policy for the First Visit Monte Carlo Algorithm, but I have not been able to find anything. Furthermore, what kind of makes the convergence difficult is that the returns $G_t$ are averaged over all returns observed, but the policy changes during the algorithm, so one does never really obtain an unbiased estimate of $\mathbb{E}_\pi[G_t | S_t = s, A_t = a]$.

If someone could link some sources or explain how one could proof convergence, I would be very greatful.

GreenLogic
  • 193
  • 6
  • The book says this is an open problem and not proved yet . . . although there may be partial proofs under some stricter assumptions – Neil Slater Jun 11 '19 at 13:56
  • @NeilSlater Interesting. I must have overread that part! What page did you find that on? – GreenLogic Jun 11 '19 at 13:59
  • I'm trying to find it. Perhaps I am mis-remembering, as I have found a couple of similar statements (about proof of convergence for Q learning (which has been done), about off-policy MC Control and about TD vs MC) – Neil Slater Jun 11 '19 at 14:13
  • 2
    Found it. Page 99 in final revision of second edition, regarding MC Control with exploring starts "Convergence to this optimal fixed point seems inevitable as the changes to the action-value function decrease over time, but has not yet been formally proved. In our opinion, this is one of the most fundamental open theoretical questions in reinforcement learning (for a partial solution, see Tsitsiklis, 2002)." Not sure if this *only* applied to ES, but I expect not, as ES is a simplification which makes MC control easier . . . – Neil Slater Jun 11 '19 at 14:18

0 Answers0