3

In the book by Richard Sutton and Andrew Barto, "Reinforcement Learning - An Introduction", 2ed edition, at page 101 there is a proof, and I don't understand 1 passage of it.

We want to prove that any $\epsilon-$greedy policy with respect to an action-value function $q_{\pi}$ is an improvement over any $\epsilon-$soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi'$ be the $\epsilon-$greedy policy. The conditions of the policy improvement theorem apply because for any $s \in \mathcal{S}$:

\begin{array}{ll} q_{\pi}(s, \pi'(s)) &= \sum\limits_a \pi'(a|s) q_{\pi}(s,a) \\ &= \dfrac{\epsilon}{|\mathcal{A}|} \sum\limits_a q_{\pi}(s,a) + (1-\epsilon) \max\limits_a q_{\pi}(s,a) \\ & \ge \dfrac{\epsilon}{|\mathcal{A}|} \sum\limits_a q_{\pi}(s,a) + (1-\epsilon) \sum\limits_a \dfrac{\pi(a|s)-\dfrac{\epsilon}{|\mathcal{A}(s)|}}{1-\epsilon} q_{\pi}(s,a) \end{array}

I did not understand this last passage. The book now has a note which tries to explain this passage:

(the sum is a weighted average with non-negative weights summing to 1, and as such it must be less than or equal to the largest number averaged)

But this still doesn't help me understand this last passage.

This question asks for the passage from line 1 to line 2, I am asking for the passage from line 2 to line 3.

Jayanth
  • 3
  • 2
robertspierre
  • 1,358
  • 6
  • 21

1 Answers1

3

I also spent some trying to understand this transition. I came up with this. Hope it helps.

$\pi$ is an $\epsilon-$soft policy. This implies that $\pi(a|s) \ge \frac{\epsilon}{|A(s)|}$ for all actions and states.

Let $S = \sum_{a} \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon} = \frac{\sum_{a}\pi(a|s) - \sum_{a}\frac{\epsilon}{|A(s)|}}{1-\epsilon} = \frac{1 - \frac{|A(s)|\epsilon}{|A(s)|}}{1-\epsilon} = 1 $

Let $w_{a} = \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon}$ so $S=\sum_{a}w_{a} =1 $

The 2nd part of the equation at line 3 becomes equal to $\sum_{a} \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon}q_{\pi}(s,a) = \sum_a w_{a}q_{\pi}(s,a)$ which is a weighted average with weights $w_{a}$ with $w_{a} \ge 0$ for all actions since the policy is $\epsilon-$soft.

Since $S$ is equal to 1 then $w_{a} \le 1$ for all actions and the weighted sum $\sum_a w_{a}q_{\pi}(s,a)$ "must be less than or equal to the largest number averaged" which is $\max_{a}q_{\pi}(s,a)$.

And we conclude that: $\sum_{a} \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon}q_{\pi}(s,a) \le \max_{a}q_{\pi}(s,a)$.

robertspierre
  • 1,358
  • 6
  • 21
M.Q
  • 46
  • 2