Monte Carlo $\epsilon$ - greedy policy is better than $\epsilon$- soft policy

Question

In the RL book of Barto and Sutton, the authors have proved that any $\epsilon$-greedy policy with respect to $q_{\pi}$ is an improvement over any $\epsilon$-soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi^{'}$ be the $\epsilon$-greedy policy. In this derivation, I couldn't understand how the authors the authors went from equation 1 to equation 2.

Equation 1 : $ q_{\pi}(s,\pi^{'}(s)) = \sum_{a}\pi^{'}(a|s)q(s,a)$

Equation 2 : $ q_{\pi}(s,\pi^{'}(s)) = \frac{\epsilon}{|A(s)|}\sum_{a} q(s,a) + ( 1 - \epsilon)max_{a}q_{\pi}(s,a)$

As far as I understand we are choosing non-greedy actions with $\epsilon$ probability and the greedy actions i.e. actions with $1 - \epsilon$ probability but then how did we end up with $\frac{\epsilon}{A(s)}$ as a weight for non-greedy actions shouldn't it be $\frac{\epsilon}{number\ of\ non-greedy \ actions}$ and this would get the summation of the weights to 1 as they are probabilities after all.

Am I missing something here? please help me out I am a beginner in RL thanks.

Possible duplicate of [Misunderstanding of E-Greedy Monte Carlo Proof](https://stats.stackexchange.com/questions/349445/misunderstanding-of-e-greedy-monte-carlo-proof) — robertspierre, Jul 16 '19 at 00:01

score 5 · Accepted Answer · answered Nov 17 '18 at 08:28

5

By a non-greedy action, they mean to pick an action that is available for state $s$, $A(s)$, with equal probability.

Hence that is how we have $\frac1{|A(s)|}$.

It is possible that we pick an action which coincides with the greedy strategy.

answered Nov 17 '18 at 08:28

Siong Thye Goh

6,431
3
17
28

Ahh, I thought they didn't coincide. it was a simple doubt indeed :) Thanks – adithya Nov 17 '18 at 08:31
Just wondering why the equation 1 holds? – FantasticAI Oct 20 '20 at 01:07
Can you help explain the difference between a $\epsilon$-soft and a $\epsilon$-greedy? I can not find the definition in the Sutton's book. – Qinsheng Zhang Feb 26 '21 at 18:26
from my understanding, $\epsilon$-greedy is a special case of $\epsilon$-soft policies. $\epsilon$-greedy specify explicitly what do you do $1-\epsilon$ of the time. – Siong Thye Goh Feb 27 '21 at 03:16

Monte Carlo $\epsilon$ - greedy policy is better than $\epsilon$- soft policy

1 Answers1

Linked