The proof for policy iteration algorithm's optimality

Question

I am trying to understand why the policy iteration algorithm in Reinforcement Learning always improves the value function until it converges. Let's assume we have the policy $\pi_0(s)$ and our value function for this policy is $V^{\pi_0}(s)$ such that:

$$V^{\pi_0}(s) = R(s,\pi_{0}(s)) + \gamma\sum_{s'}p(s'|s,\pi_{0}(s)) V^{\pi_0}(s')$$

Then, according to the policy iteration algorithm, we find the next optimal policy, $\pi_{1}(s)$ such that:

$$\pi_{1}(s) = argmax_{a} R(s,a) + \gamma\sum_{s'}p(s'|s,a) V^{\pi_0}(s') $$

So, then we evaluate $\pi_{1}(s)$ such that the value function induced by $\pi_{1}(s)$ converges:

$$V^{\pi_1}(s) = R(s,\pi_{1}(s)) + \gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_1}(s')$$ In order the policy iteration to converge to the optimal value function, we need to have: $$V^{\pi_1}(s) \geq V^{\pi_0}(s), \forall s$$

I have difficulty to show this. What I am doing is the following: After we find $\pi_1(s)$, we have: $$ R(s,\pi_1(s)) + \gamma\sum_{s'}p(s'|s,\pi_1(s)) V^{\pi_0}(s') \geq V^{\pi_0}(s)$$ by the definition of $\pi_1(s)$.

We also have: $$V^{\pi_1}(s) -\gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_1}(s') = R(s,\pi_{1}(s))$$

Combining these, we then have: $$V^{\pi_1}(s) -\gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_1}(s') \geq V^{\pi_0}(s) -\gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_0}(s')$$

I found a similar question and its answer here: Why does policy iteration algorithm converge to optimal value? (reinforcement learning) The first answer there reached the same inequality, in the matrix form: $$\left[I -\gamma P^{\pi_1} \right]V^{\pi_1} \geq \left[I -\gamma P^{\pi_1} \right]V^{\pi_0}$$ and simply said that this leads to $V^{\pi_1} \geq V^{\pi_0}$.

I don't see why this is necessarily true. Multiplying both sides with the matrix inverse $(I -\gamma P^{\pi_1})^{-1}$ is not guaranteed to preserve the direction of inequality. Then how to proceed from here and show that the value function improves?

score 1 · Accepted Answer · answered Feb 12 '19 at 16:12

I think I have found an answer to my question.

$P^{\pi_1}$ is a stochastic matrix, whose rows sum to 1 and all entries nonnegative. It can be shown that for a stochastic matrix, the eigenvalue with the largest absolute value is $1$. (Proof is here: https://math.stackexchange.com/questions/40320/proof-that-the-largest-eigenvalue-of-a-stochastic-matrix-is-1 , also, Perron-Frobenius and Gershgorin Circle Theorems are related).

Hence the spectral radius of $\gamma P^{\pi_1}$ is then $\gamma$. Now, we have $(I - \gamma P^{\pi_1})$, for which it is easy to see that it is a Z-Matrix, which is a matrix with all non-diagonal entries non-positive. Moreover, it is a M-Matrix. An M-Matrix is defined as a matrix with the form $kI-B$, where $B$ is a non-negative matrix whose spectral radius is strictly less than $k$. $(I - \gamma P^{\pi_1})$ has the form $kI-B$ where $k=1$ and $B=\gamma P^{\pi_{1}}$. We know that the spectral radius of $\gamma P^{\pi_{1}}$ is $\gamma$ and since it is the discount factor, it is always $\gamma < 1$. This concludes that $(I - \gamma P^{\pi_1})$ is a M-matrix. M-matrices have the special property that if they are non-singular, their inverses are always non-negative. Hence, $(I - \gamma P^{\pi_1})^{-1}_{ij}\geq 0$ for all $1 \leq i,j \leq n$.

We had $(I - \gamma P^{\pi_1})V^{\pi_1} \geq (I - \gamma P^{\pi_1})V^{\pi_0}$. Since $(I - \gamma P^{\pi_1})^{-1}$ is nonnegative, we have:

$$(I - \gamma P^{\pi_1})^{-1}(I - \gamma P^{\pi_1})V^{\pi_1} \geq (I - \gamma P^{\pi_1})^{-1}(I - \gamma P^{\pi_1})V^{\pi_0}$$ and $$V^{\pi_1} \geq V^{\pi_0}$$

which concludes the proof.

The proof for policy iteration algorithm's optimality

1 Answers1

Linked