Misunderstanding of E-Greedy Monte Carlo Proof

Question

I'm confused about one step of the e-greedy Monte Carlo control proof on page 83 of Sutton and Barto Reinforcement Learning.

The book annotates saying "(the sum is a weighted average with nonnegative weights summing to 1, and as such it must be less than or equal to the largest number averaged)"

I understand that this substitution into the summation is essentially a max in the context of this problem with some properties allowing it to simplify nicely.

My question is why does this substitution make the right side of the equation less than or equal to the state-action value q(a,s)?

My thinking was because this weight acts like a max and because the value function is following the same policy, wouldn't this continue to be equal to q(a,s)?

I've included the full proof from Sutton and Barto and the algorithm below. Thanks!

Neil Slater · Accepted Answer · 2018-06-02T11:23:46.140

The proof is showing that if you have measured or calculated $q_{\pi}(s,a)$ under $\pi$ and then set $\pi'$ such that it is $\epsilon$-greedy over $q_{\pi}(s,a)$, then $\pi'$ is at least equal or better than $\pi$.

The values are equal only when greedy due to $\pi'$ is same as the choice under $\pi$.

A worked example might help:

Say for a specific $s$ there are four actions $[a_0, a_1, a_2, a_3]$. Under an initial $\epsilon$-greedy policy $\pi$ the preferred action choice is $a_3$, so the action probabilities are $[\epsilon/4, \epsilon/4, \epsilon/4, 1 - \epsilon + \epsilon/4]$. The action-values for this policy, $q_{\pi}(s,[a_0, a_1, a_2, a_3])$ are then measured as $[5, 2, 3, 4]$.
We can compare these parts of the proof:
- Part of (5.2) substituting 5 for $q(s,a_0)$: $$(1-\epsilon)\text{max}_a q_{\pi}(s,a) = 5(1-\epsilon)$$
- Corresponding part of next line, substituting 4 for $q(s,a_3)$ and $1-\epsilon + \epsilon/4$ for $\pi(a_3|s)$ $$(1-\epsilon)\sum_a \frac{\pi(a|s)-\frac{\epsilon}{4}}{1 - \epsilon} q_{\pi}(s,a) = 4(1-\epsilon)$$
You may need to work through the second one to see how the non-greedy actions resolve to $0$ and the factor of ${1 - \epsilon}$ cancels out. But here in the example, the second term is indeed less than the first one, and hopefully that addresses your concerns that they look like they would always be equal. You can repeat the process with $a_3$ already being the maximising action, and the two terms will then be equal.

I suspect it is the strict sequence of define/measure/refine for the policy improvement that is throwing you out - you may think that $\pi$ is defined as $\text{argmax}_a q(s,a)$ always, as that's a common way of treating it in the pseudocode. However, in this case we're trying to prove that this approach works, so are looking separately at whether making the policy change to follow the max q values is justified theoretically. So following the use of $\pi$, $\pi'$ and the subscript on $q_{\pi}$ are critical to following the proof.

This is a specific example with specific values, but in no way is a general proof of the equation posted by the OP — robertspierre, Jul 15 '19 at 23:42
The problem I have is that $\pi$ is not defined as an initial $\epsilon-$greedy policy, but more broadly as an initial $\epsilon-$soft policy, so its is not true that "You may need to work through the second one to see how the non-greedy actions resolve to 0 and the factor of 1−ϵ cancels out". In fact, under $\pi$, the estimated non-greedy actions are not constrained to be $=\epsilon/|A|$, but can be greater than that — robertspierre, Jul 15 '19 at 23:48
@raffamaiden: It is not intended to be a general proof, and OP did not ask for one. The worked proof is in Sutton & Barto, this is just clarifying a single step of that proof. Probably the results in S&B can be extended to general $\epsilon$-soft provided it is replaced by $\epsilon$-greedy on the next iteration, but yes then the factor will not cancel out. I suggest if you are looking for a more rigorous proof showing this that you ask a new question on the site. I think it would be of interest here — Neil Slater, Jul 16 '19 at 06:45
why does $q_{\pi}(s, {\pi}') = \sum \pi'(a|s)q_{\pi} (s|a)$ ? — FantasticAI, Oct 16 '20 at 17:03
@keqiaoli I don't state that in this answer, but it is in the original proof. There is a small liberty with the notation, so it might just not be clear what they are trying to express with $q_{\pi}(s, \pi'(s))$. It basically follows from definition of $q_{\pi}$ . Not enough room for an answer in comments, so perhaps ask a new question on the site? — Neil Slater, Oct 16 '20 at 18:05
I have posted a new question here, feel free to check it out. https://stats.stackexchange.com/questions/492781/one-small-confusion-on-e-greedy-policy-improvement-based-on-monte-carlo — FantasticAI, Oct 20 '20 at 01:44

score 1 · Answer 2 · answered Nov 08 '21 at 23:04

1

Basically the weighted average assigns (zero) one to the (non-)greedy action(s) w.r.t. pi, respectively. This can then be read as "The new greedy action w.r.t. pi’ gives AT LEAST the same action value as the greedy action w.r.t. pi"

answered Nov 08 '21 at 23:04

Juan Orozco

11
1

Misunderstanding of E-Greedy Monte Carlo Proof

2 Answers2

Linked