1

I am having trouble to understand this algorithm, since this is not how I would have written it.

To me, we should first start to fix a policy. Then, we evaluate the Q values associated with this policy by doing exploration and reducing the probability of exploring after each episode. Now we have a good estimate of Q for this policy, we can improve the policy by taking the argmax of Q (it will improve the policy with the policy improvement theorem). And we repeat it again and again.

Here, instead, we are updating Q with a new episode sampled with the current policy. However, this Q we are updating was computed using previous episodes sampled from previous policies. How can you have a good estimate of Q for a specific policy with computations from other policies and only one episode from that policy? Moreover, the policy improvement theorem states that we actually improve the policy if Q is computed exactly. Here, this is clearly not the case, and it motivates me to follow the first approach that I explained.

Can you tell if my approach has a name, and why the Monte Carlo Control would be more efficient?

The algorithm

0 Answers0