5

I am reading Reinforcement Learning, An Introduction by Sutton, Barto and I came across the derivation

$$ \begin{align} v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ G_{t} | S_{t} = s \right] \\ &= \mathbb{E}_{\pi}\left[ R_{t+1} + \gamma G_{t+1} | S_{t} = s \right] & (1) \\ &= \sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a)\left[r+ \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1}=s']\right] & (2) \\ &= \sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a)\left[r+ \gamma v_{\pi}(s') \right]. \end{align} $$

I understand that

$$ \mathbb{E}_{\pi}\left[R_{t+1}|S_{t}=s\right] = \sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) \cdot r $$

but I do not understand how

$$ \begin{align} \mathbb{E}_{\pi}\left[\gamma G_{t+1} | S_{t}=s\right] = \sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) \mathbb{E}_{\pi}\left[\gamma G_{t+1} | S_{t+1}=s'\right] & (3) \end{align} $$

I have read other questions about this like Deriving Bellman's Equation in Reinforcement Learning but I don't see any answers that talk about this directly. They mention that the law of total expectation comes into play but I am unable to use that to derive $(3)$.

Can you explain how the author goes from $(1)$ to $(2)$?

EDIT: To add a little more detail on the way I tried to convince myself:

$$ \begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} + \gamma G_{t+1} | S_{t} = s \right] &= \mathbb{E}_{\pi}\left[ R_{t+1} | S_{t} = s \right] + \mathbb{E}_{\pi}\left[ \gamma G_{t+1} | S_{t} = s \right] \\ &= \left[\sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) \cdot r\right] + \mathbb{E}_{\pi}\left[ \gamma G_{t+1} | S_{t} = s \right] \\ &= \left[\sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) \cdot r\right] + \mathbb{E}_{\pi}\left[ \gamma \mathbb{E}_{\pi}\left[G_{t+1} | S_{t+1}=s' \right] | S_{t} = s \right] \\ \end{align} $$

and the only way I can see (2) being derived from this is if (3) holds but I am unable to convince myself that (3) is true.

A. Wong
  • 151
  • 3
  • 1
    What's wrong with the answer given by Jie Shi in the thread you refer to? Essentially it seems to me as if one just writes that out by the same rule for conditional expectations you quote above (and you believe) and then you drag along this nasty sum over $g_{t+1}$ and just put everything back together after you are done with the first part... – Fabian Werner Jan 06 '19 at 23:19
  • Ok, there is a weird step where he pulls apart $p(s′,r,g_{t+1}|a,s)$ which is non-trivial because one needs to know something about the common density of the random variables $S'=S_{t+1}, R_t, G_{t+1}, A_t, S=S_t$ and for $G_{t+1}$ it is not even clear that this is a valid expression (does it converge? If so, where? If not, where not?) but as Sutton uses only about 1/10 of math rigorously (i.e. 9/10 is handwaving and blahblah) you have to believe some things or go into the muddy details... – Fabian Werner Jan 06 '19 at 23:24
  • @FabianWerner yea, im holding my breath and hoping something comes out of the mud – A. Wong Jan 07 '19 at 02:16
  • Som just to understand correctly: Your goal is to understand why one can pull apart this probability $p(s′,r,g_{t+1}|a,s)$? Or what do you ask for then? – Fabian Werner Jan 07 '19 at 07:33
  • Thats right. How are we deriving (3)? – A. Wong Jan 07 '19 at 23:21
  • We are deriving (3) exactly as in the answer of Jie Shi in the thread you are referring to. What exactly is the question? – Fabian Werner Jan 08 '19 at 07:57
  • @FabianWerner $\mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_{t} = s] = \mathbb{E}[R_{t+1} | S_{t} = s] + \mathbb{E}[\gamma G_{t+1} | S_{t} = s] = \sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) \cdot r + (?)$. (?) must be $\sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) \mathbb{E}_{\pi}\left[\gamma G_{t+1} | S_{t+1}=s'\right]$ but why? – A. Wong Jan 09 '19 at 07:04
  • Jie Shi starts at the point you want to start ($[R_{t+1}+\gamma G_{t+1}|S_t=s]$) and ends at the point you want to end ($\sum_{a}\pi(a|s) \sum_{s'}\sum_{r} p(s',r|s,a) E_{\pi}\left[\gamma G_{t+1} | S_{t+1}=s'\right]$) so again: what is the question? Where exactly are you being lost (row number)? – Fabian Werner Jan 09 '19 at 14:17
  • @FabianWerner I am unable to convince myself of line 2->3 in Jie Shi's answer. I apologize if I wasn't clear on (3) being his line 3 when I mentioned it in a previous comment. – A. Wong Jan 10 '19 at 18:13
  • I added an answer to https://stats.stackexchange.com/questions/243384/deriving-bellmans-equation-in-reinforcement-learning/391113#391113: I do not think that one even knows whether $p(g_{t+1}|s_{t+1}, s_t)$ and all these miraculous terms even exist!!! However, one can get around that by using a limit argument, see my answer... does that help? – Fabian Werner Feb 06 '19 at 15:55

1 Answers1

1

From the answer in Deriving Bellman's Equation in Reinforcement Learning, I think we can get at the problem by juggling some of the terms.

First, some notation: $G_{t+1}$ is a random variable that can take on values $g \in \Gamma$. By definition we have the first line here:

$$\begin{align} \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] & \doteq \gamma \sum_{g \in \Gamma} g p(g|s) & (1)\\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \sum_{g \in \Gamma} g p(g,s',a,r|s) & (2) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \sum_{g \in \Gamma} g p(g|s',a,r,s) p(s',a,r|s) & (3) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \left( \sum_{g \in \Gamma} g p(g | s') \right) p(s', r | a, s) \pi(a | s) & (4)\\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \mathbb{E}_{\pi}\left[ G_{t+1} | S_{t+1} = s' \right] p(s', r | a, s) \pi(a | s) & (5) \end{align}$$

The manipulations are as follows:

(1) The definition of expectation

(2) "un-marginalize" the expectation to include $s'$, $a$, and $r$

(3) Use the law of multiplication to manipulate the p() term from line (2)

(4) Use the Markovian property to take $p(g|s',a,r,s) = p(g|s')$, the law of multiplication to change $p(s',a,r|s) = p(s',r|a,s) p(a|s)$, and adapt to Sutton and Barto's notation with $\pi(a|s) \doteq p(a|s)$

(5) Recognize that, again by definition, that the term in parenthesis on line (4) is $\mathbb{E}_{\pi}\left[ G_{t+1} | S_{t+1} = s' \right]$

Finncent Price
  • 303
  • 4
  • 10