0

Suppose we are given a MDP and a policy $\pi$, where for simplicity we have exactly one possible start state $s_0$. Let $V^\pi(s)$ denote the expected return when being in a state s. Moreover, denote by $d^\pi(s)$ the occupancy frequency or on-policy distribution (see for instance http://incompleteideas.net/book/bookdraft2017nov5.pdf on page 163). My question is now: Does the following hold:

$V^\pi(s_0) = \sum_s d^\pi(s) \cdot V^\pi(s)$

where the sum runs over all possible states s. I assume there are just finitely many here. I cannot prove it somehow and I think it does not hold but I do not know of any example. Do you have a proof for that or give some counterexample? Any help appreciated!

  • Is your equation complete, as you intended? You mention $d^{\pi}(s)$ but do not use it? Also could you clarify that this is for an episodic problem? – Neil Slater Oct 14 '19 at 16:13
  • Also, are you using the definition of return: $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ ? There are other options, but that is commonly used one in Sutton & Barto. – Neil Slater Oct 14 '19 at 16:20
  • Yes Neil Slater, I forgot the $d^\pi(s)$, I am so sorry! – NeukirchLover Oct 15 '19 at 18:57

1 Answers1

0

For sure the formula that you state does not hold. Assume that the rewards are all positive and that there exists a state $s$ such that $V^\pi(s) > 0$ then

$$\text{RHS} = V^\pi(s_0) + V^\pi(s) + ... \geq V^\pi(s_0) + V^\pi(s) > V^\pi(s_0) = \text{LHS}$$

What you might mean is that

$$V^\pi(s) = \sum_{a} \pi(a|s) Q^\pi(s,a)$$

where $Q^\pi(s,a) = \left(E^\pi[R|S_t=s,A_t=a]\right)$ [here, $R = \sum_{t=0}^\infty \gamma^t R_t$ is the reward variable] which is easily proved:

Assume that $X:\Omega \to \mathcal{X}, Y:\Omega \to \mathcal{Y}, Z : \Omega \to \mathcal{Z}$ are random variables and have a common density $p(x,y,z)$ and furthermore assume that $p(y)>0$ for all $y$, then one can show that $$E[X|Y=y] = \int_{\mathcal{X}} x \cdot p(x|y) dx$$ and $$E[X|Y=y,Z=z] = \int_{\mathcal{X}} x \cdot p(x|y,z) dx$$

(see https://math.stackexchange.com/questions/496608/formal-definition-of-conditional-probability/498338#498338).

Then \begin{align*} E[X|Y=y] &= \int_{\mathcal{X}} x \cdot p(x|y) dx \\ &= \int_{\mathcal{X}} x \cdot \frac{p(x,y)}{p(y)} dx \\ &= \int_{\mathcal{X}} x \cdot \frac{\int_{\mathcal{Z}} p(x,y,z) dz}{p(y)} dx \\ &= \int_{\mathcal{Z}} \int_{\mathcal{X}} x \cdot \frac{ p(x,y,z)p(y,z)}{p(y,z)p(y)} dx dz \\ &= \int_{\mathcal{Z}} \frac{p(z,y)}{p(y)} \int_{\mathcal{X}} x \cdot p(x|y,z) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz \end{align*}

Applied to the situation above we have $X=R, Y=S_0, Z=A_0$ and $$V^\pi(s) = E[R|S_0=s] = E[X|Y=y] = \sum_{z \in A} p(z|y) E[X|Y=y,Z=z] = \sum_{a \in A} p(a|s) Q^\pi(s,a)$$

Fabian Werner
  • 3,055
  • 1
  • 9
  • 25
  • Thank you very much for your fast answer Fabian! I am so sorry that I forgot to weight each term in the same with its distribution probability. Otherwise it would be kinda obvious :D Do you still think the upper equation is wrong? – NeukirchLover Oct 15 '19 at 18:58
  • Although I have not worked it out explicitly I am still convinced that I either do not understand the equation you have written down or it is wrong. Reason: the Bellman equation ( see https://stats.stackexchange.com/questions/243384/deriving-bellmans-equation-in-reinforcement-learning/391113#391113) tells us that $v(s) = const + \sum_{s‘} p(s‘|s) v(s‘)$ which does not seem to match the one you want... so let me ask: what is the context? Is the Bellman equation sufficient for your purpose or do you explicitly need the equation in your form? – Fabian Werner Oct 16 '19 at 11:57