Proof that any $\epsilon-$greedy policy is an improvement over any $\epsilon-$soft policy

Question

In the book by Richard Sutton and Andrew Barto, "Reinforcement Learning - An Introduction", 2ed edition, at page 101 there is a proof, and I don't understand 1 passage of it.

We want to prove that any $\epsilon-$greedy policy with respect to an action-value function $q_{\pi}$ is an improvement over any $\epsilon-$soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi'$ be the $\epsilon-$greedy policy. The conditions of the policy improvement theorem apply because for any $s \in \mathcal{S}$:

\begin{array}{ll} q_{\pi}(s, \pi'(s)) &= \sum\limits_a \pi'(a|s) q_{\pi}(s,a) \\ &= \dfrac{\epsilon}{|\mathcal{A}|} \sum\limits_a q_{\pi}(s,a) + (1-\epsilon) \max\limits_a q_{\pi}(s,a) \\ & \ge \dfrac{\epsilon}{|\mathcal{A}|} \sum\limits_a q_{\pi}(s,a) + (1-\epsilon) \sum\limits_a \dfrac{\pi(a|s)-\dfrac{\epsilon}{|\mathcal{A}(s)|}}{1-\epsilon} q_{\pi}(s,a) \end{array}

I did not understand this last passage. The book now has a note which tries to explain this passage:

(the sum is a weighted average with non-negative weights summing to 1, and as such it must be less than or equal to the largest number averaged)

But this still doesn't help me understand this last passage.

This question asks for the passage from line 1 to line 2, I am asking for the passage from line 2 to line 3.

Are you asking why the claim in note is true, or how to verify that these weights sum up to 1? — CloseToC, Jul 16 '19 at 08:02
The note tells you why it follows from the previous line, so the question is which parts of the note you need help understanding with — CloseToC, Jul 16 '19 at 08:09
Well the first part, why it is a weighted average that sums to 1. Thank you — robertspierre, Jul 16 '19 at 09:23
The weighted average does not sum to 1, the weights in front of $q_\pi(s,a)$ do, here's why: $\pi(a|s)$ presumably sums to 1 over all $a$. Then the sum of the weights can be written as $\frac{1}{1-\epsilon} - |A| \cdot\frac{\epsilon}{ (1-\epsilon) \cdot |A|}$ which is 1 — CloseToC, Jul 16 '19 at 11:48

score 3 · Accepted Answer · edited Jul 21 '19 at 12:49

I also spent some trying to understand this transition. I came up with this. Hope it helps.

$\pi$ is an $\epsilon-$soft policy. This implies that $\pi(a|s) \ge \frac{\epsilon}{|A(s)|}$ for all actions and states.

Let $S = \sum_{a} \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon} = \frac{\sum_{a}\pi(a|s) - \sum_{a}\frac{\epsilon}{|A(s)|}}{1-\epsilon} = \frac{1 - \frac{|A(s)|\epsilon}{|A(s)|}}{1-\epsilon} = 1 $

Let $w_{a} = \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon}$ so $S=\sum_{a}w_{a} =1 $

The 2nd part of the equation at line 3 becomes equal to $\sum_{a} \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon}q_{\pi}(s,a) = \sum_a w_{a}q_{\pi}(s,a)$ which is a weighted average with weights $w_{a}$ with $w_{a} \ge 0$ for all actions since the policy is $\epsilon-$soft.

Since $S$ is equal to 1 then $w_{a} \le 1$ for all actions and the weighted sum $\sum_a w_{a}q_{\pi}(s,a)$ "must be less than or equal to the largest number averaged" which is $\max_{a}q_{\pi}(s,a)$.

And we conclude that: $\sum_{a} \frac{\pi(a|s) - \frac{\epsilon}{|A(s)|}}{1-\epsilon}q_{\pi}(s,a) \le \max_{a}q_{\pi}(s,a)$.

why $ q_{\pi}(s,\pi^{'}(s)) = \sum_{a}\pi^{'}(a|s)q(s,a)$ holds? It really confuses me — FantasticAI, Oct 20 '20 at 02:37

Proof that any $\epsilon-$greedy policy is an improvement over any $\epsilon-$soft policy

1 Answers1