1

I'm reading Szepesvári's book on RL. My question is concerning the proof of Theorem A.10 (p. 71).

Theorem Let $V$ be the fixed point of $T^∗$ and assume that there is policy $π$ which is greedy w.r.t $V:T^πV=T^∗V$. Then $V=V^∗$ and $π$ is an optimal policy.

With the Bellman Optimality Operator, which is a contraction, defined as $$(T^∗V )(x)=\text{sup}_{a\in \mathcal{A}}\Bigl\{ r(x,a)+ \gamma \sum_{y\in \mathcal{X}}P(x, a, y) V(y) \Bigr\},\ x\in \mathcal{X}$$

and the optimal value-functions defined as $$V^*(x) = \text{sup}_{a \in \mathcal{A}} Q^*(x, a), \ x\in\mathcal{X}$$ $$Q^*(x, a) = r(x, a) + \gamma\sum_{y \in\mathcal{X}}P(x,a,y) V^*(y), \ x\in\mathcal{X}, a\in\mathcal{A}.$$ Thus $$V^*(x) = \text{sup}_{a \in \mathcal{A}} \Bigl\{ r(x,a) + \gamma\sum_{y\in \mathcal{X}} P(x, a, y) V^*(y) \Bigr\}, \ x \in \mathcal{X}.$$

In the proof of the Theorem, Szepesvári states that we cannot know if $V=V^*$ or not. My Question is: Why can't we? Applying $T^*$ to $V^*$ we get $T^*V^*=V^*$ and due to Banach's fixed point theorem the fixed point of $T^*$ is unique, hence $V=V^*$.

0 Answers0