I was reading Sutton's book Reinforcement Learning: An Introduction, especially policy iteration part.
There was a proof for convergence of policy iteration with deterministic policy.
So i tried to find the proof for the case of stochastic policy, Curiously.
But i couldn't find any clear explanations dealing with it.
Can i have clear proof for convergence of policy iteration with stochastic policy?