I am trying to understand why the policy iteration algorithm in Reinforcement Learning always improves the value function until it converges. Let's assume we have the policy $\pi_0(s)$ and our value function for this policy is $V^{\pi_0}(s)$ such that:
$$V^{\pi_0}(s) = R(s,\pi_{0}(s)) + \gamma\sum_{s'}p(s'|s,\pi_{0}(s)) V^{\pi_0}(s')$$
Then, according to the policy iteration algorithm, we find the next optimal policy, $\pi_{1}(s)$ such that:
$$\pi_{1}(s) = argmax_{a} R(s,a) + \gamma\sum_{s'}p(s'|s,a) V^{\pi_0}(s') $$
So, then we evaluate $\pi_{1}(s)$ such that the value function induced by $\pi_{1}(s)$ converges:
$$V^{\pi_1}(s) = R(s,\pi_{1}(s)) + \gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_1}(s')$$ In order the policy iteration to converge to the optimal value function, we need to have: $$V^{\pi_1}(s) \geq V^{\pi_0}(s), \forall s$$
I have difficulty to show this. What I am doing is the following: After we find $\pi_1(s)$, we have: $$ R(s,\pi_1(s)) + \gamma\sum_{s'}p(s'|s,\pi_1(s)) V^{\pi_0}(s') \geq V^{\pi_0}(s)$$ by the definition of $\pi_1(s)$.
We also have: $$V^{\pi_1}(s) -\gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_1}(s') = R(s,\pi_{1}(s))$$
Combining these, we then have: $$V^{\pi_1}(s) -\gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_1}(s') \geq V^{\pi_0}(s) -\gamma\sum_{s'}p(s'|s,\pi_{1}(s)) V^{\pi_0}(s')$$
I found a similar question and its answer here: Why does policy iteration algorithm converge to optimal value? (reinforcement learning) The first answer there reached the same inequality, in the matrix form: $$\left[I -\gamma P^{\pi_1} \right]V^{\pi_1} \geq \left[I -\gamma P^{\pi_1} \right]V^{\pi_0}$$ and simply said that this leads to $V^{\pi_1} \geq V^{\pi_0}$.
I don't see why this is necessarily true. Multiplying both sides with the matrix inverse $(I -\gamma P^{\pi_1})^{-1}$ is not guaranteed to preserve the direction of inequality. Then how to proceed from here and show that the value function improves?