Fixed point of the Bellman operator for suboptimal policies

Asked Feb 05 '22 at 19:58

Active Feb 06 '22 at 01:13

Viewed 17 times

Consider an MDP and let the Bellman operator be defined as follows,

$$ (T^\pi_\gamma V)(s) = \sum_{a\in A}\pi(s)\big(r(s,a) + \gamma \sum_{s' \in S} p(s'\mid s,a) V(s')\big) $$

where,

$\pi:S\to \Delta(A)$ is a policy, i.e., a function that maps states $s\in S$ to a probability distribution over actions $a\in A$
$r(s,a)$ is the reward for taking action $a$ in state $s$
$p(s'\mid s,a)$ is the transition model mapping state-action pairs $(s,a)$ to a distribution over next states $s'$
$\gamma\in[0,1)$ is a discount factor

Question: Fix a policy $\pi$ that is not necessarily optimal. If we can show that $T^\pi_\gamma$ is contractive for this fixed policy, then does the resulting fixed point equation yield the value for that policy? Most of the results that I've read only talk about the fixed point of the Bellman operator with respect to an optimal policy, but I am interested in characterizing the value of various suboptimal policies.

edited Feb 06 '22 at 01:13

asked Feb 05 '22 at 19:58

Erik M

Fixed point of the Bellman operator for suboptimal policies

0 Answers0