3

In the RL book of Barto and Sutton, the authors have proved that any $\epsilon$-greedy policy with respect to $q_{\pi}$ is an improvement over any $\epsilon$-soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi^{'}$ be the $\epsilon$-greedy policy. In this derivation, I couldn't understand how the authors the authors went from equation 1 to equation 2.

Equation 1 : $ q_{\pi}(s,\pi^{'}(s)) = \sum_{a}\pi^{'}(a|s)q(s,a)$

Equation 2 : $ q_{\pi}(s,\pi^{'}(s)) = \frac{\epsilon}{|A(s)|}\sum_{a} q(s,a) + ( 1 - \epsilon)max_{a}q_{\pi}(s,a)$

As far as I understand we are choosing non-greedy actions with $\epsilon$ probability and the greedy actions i.e. actions with $1 - \epsilon$ probability but then how did we end up with $\frac{\epsilon}{A(s)}$ as a weight for non-greedy actions shouldn't it be $\frac{\epsilon}{number\ of\ non-greedy \ actions}$ and this would get the summation of the weights to 1 as they are probabilities after all.

Am I missing something here? please help me out I am a beginner in RL thanks.

Siong Thye Goh
  • 6,431
  • 3
  • 17
  • 28
adithya
  • 65
  • 1
  • 7

1 Answers1

5

By a non-greedy action, they mean to pick an action that is available for state $s$, $A(s)$, with equal probability.

Hence that is how we have $\frac1{|A(s)|}$.

It is possible that we pick an action which coincides with the greedy strategy.

Siong Thye Goh
  • 6,431
  • 3
  • 17
  • 28