Notation in Trust Region Policy Optimization by John Schulman et al

Question

I am quite new to the area of reinforcement learning and find it hard to convice myself that the different notations used for reward function, state/action value function etc. coincide. Apparently I am not the only one and many people hope for an uniform notation and defintions (see link).

Currently, I am working on TRPO (this paper) and try to check if the notation used is consistent with the definitions of Sutton and Barto (which I consider as the standard reference for RL, correct me if I am wrong).

Firstly, the TRPO paper considers reward functions only depending on the current state $$ r : \mathcal{S} \to \mathbb{R}. $$ Considering this simplification of the defintion in Sutton Barto $$ r(s,a) = \mathbb{E} [R_{t+1} \mid S_t = s, A_t = a], $$ I concluded it must be defined as $$ r(s) = \mathbb{E} [R_{t+1} \mid S_t = s]. $$ The authors of TRPO define then the action and state value function as \begin{align*} Q_\pi(s_t,a_t) &= \mathbb{E}_{s_{t+1}, a_{t+1}, \dots} \left[ \sum_{l=0}^\infty \gamma^l r(s_{t+l}) \right], \\ V_\pi(s_t) &= \mathbb{E}_{a_t, s_{t+1}, \dots} \left[ \sum_{l=0}^\infty \gamma^l r(s_{t+l}) \right] \end{align*} where $s_{t+1} \sim \pi(a_t\mid s_t)$ and $a_{t+1} \sim p(s_{t+1} \mid s_t, a_t)$ for $t \geq 0$.

Does this defintion coincide with the one from Barto and Sutton ? \begin{align*} V_\pi(s) &= \mathbb{E}_\pi [G_t \mid S_t = s] \\ Q_\pi(s,a) &= \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a] \end{align*} where $$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} +\dots . $$ I really prefer this second definition as it clearly separates random variables and input variables. Nevertheless, I am also not quite sure what the notation $$ \mathbb{E}_\pi [G_t \mid S_t = s] $$ mean. Is it just a short notation for $$ \mathbb{E}_{A_{t+1}, S_{t+1},\dots} \left[\mathbb{E} [G_t \mid S_t = s] \right] = \int \int \dots \mathbb{E} [G_t \mid S_t = s] d P_{A_{t+1}} d P_{S_{t+1}} \dots $$ ?

Notation in Trust Region Policy Optimization by John Schulman et al

0 Answers0