Consider an MDP and let the Bellman operator be defined as follows,
$$ (T^\pi_\gamma V)(s) = \sum_{a\in A}\pi(s)\big(r(s,a) + \gamma \sum_{s' \in S} p(s'\mid s,a) V(s')\big) $$
where,
- $\pi:S\to \Delta(A)$ is a policy, i.e., a function that maps states $s\in S$ to a probability distribution over actions $a\in A$
- $r(s,a)$ is the reward for taking action $a$ in state $s$
- $p(s'\mid s,a)$ is the transition model mapping state-action pairs $(s,a)$ to a distribution over next states $s'$
- $\gamma\in[0,1)$ is a discount factor
Question: Fix a policy $\pi$ that is not necessarily optimal. If we can show that $T^\pi_\gamma$ is contractive for this fixed policy, then does the resulting fixed point equation yield the value for that policy? Most of the results that I've read only talk about the fixed point of the Bellman operator with respect to an optimal policy, but I am interested in characterizing the value of various suboptimal policies.