Optimal action-value as function of optimal value. Proof

Question

Currently reading through Algorithms for Reinforcement Learning, I think these notes are good, but there're bits that are a bit unclear, and I have few questions that I think are quite basic:

Definition of optimal value function definition:

Quoting the notes in the relevant bits:

The optimal value $V^*(x)$ of state $x$ gives the highest achievable expected return when the process is started from state $x$. The function $V^* : \mathcal{X} \to \mathbb{R}$ is called the optimal value function

Later what a stochastic stationary policy is defined and with $\Pi_{stat}$ it is denoted the set of all stationary policies. Quoting the notes again

The value function $V^{\pi} : \mathcal{X} \to \mathbb{R}$, underlying $\pi$ is defined by $$ V^{\pi}(x) = \mathbb{E} \left[ \left. \sum_{t=0}^\infty \gamma^t R_{t+1} \right| X_0 = x \right] ,\;\; x \in \mathcal{X} $$

moreover we have a definition for *value function underlying an MRP

$$ V(x) = \mathbb{E} \left[ \left. \sum_{t=0}^\infty \gamma^t R_{t+1} \right| X_0 = x \right] ,\;\; x \in \mathcal{X} $$

what is the relationship between $V^*(x)$ and $V(x)$? is it by any chance

$$ V^*(x) = \sup_{\pi \in \Pi_{stat}} V^{\pi}(x) $$

?

I assume the relationship between $Q^{\pi}(x,a)$ and $Q^*(x,a)$ is very similar, namely (assuming I'm right)

$$ Q^*(x,a) = \sup_{\pi \in \Pi_{stat}} Q^{\pi}(x,a) $$

Relationship between optimal value function and optimal action-value function

The optimal value- and action-value functions are connected in the following way: $$ V^*(x) = \sup_a Q^*(x,a) $$ $$ Q^*(x,a) = r(x,a) + \gamma \sum_{y \in \mathcal{Y}} \mathcal{P}(x,a,y)V^*(y) $$

I understand the meaning of the first equation, but I don't know where the second comes from. Assuming the definition of $V^*(x)$ and $Q^*(x,a)$ I guessed earlier are correct what is the actual math explaining the two equalities?

Note : You can look up at section 2.2. of the pdf I attached (the section is short, just in case you need more details).

Optimal action-value as function of optimal value. Proof

0 Answers0