I am currently trying to find a formal proof of convergence for the Monte Carlo Reinforcement Learning Methods described in Sutton,Barto's Book "Reinforcement Learning - An Introduction" , Section 5.
They explain that along the ideas of generalized policy iteration, one obtains a uniformly better (or equally good) policy in each step of the iteration. However, all of this is under the assumption that the exact value functions are known.
I have been looking for a formal proof of convergence to the optimal policy for the First Visit Monte Carlo Algorithm, but I have not been able to find anything. Furthermore, what kind of makes the convergence difficult is that the returns $G_t$ are averaged over all returns observed, but the policy changes during the algorithm, so one does never really obtain an unbiased estimate of $\mathbb{E}_\pi[G_t | S_t = s, A_t = a]$.
If someone could link some sources or explain how one could proof convergence, I would be very greatful.