7

I have a complex MDP (I think) as follows. anyone can describe me simply how the value for state $V(A)^*$ is find?

enter image description here

First Update: really for this solved question I need a canonical answer, step by step solution, if any for learning purpose.

Second Update: I need a complete solution, i.e: how this example reach to this answer.

1 Answers1

4

For action $a$, we have $Q(A, d) = 0.1 V(B) + 0.9 V(A)$. For action $b$, we have $Q(A, b) = 0.8 V(B) + 0.2 V(A)$.

Then, we have $V(A) = -0.1 + \max(Q(A, a), Q(A, b))$. We also have $V(B) = 1$.

That is all you need to solve the problem. The easiest way might be by fixing the policy. I.e. assuming $Q(A, a) > Q(A, b)$ and working out what $V(A)$ would be; then doing the same for $Q(A, b) > Q(A, a)$, and checking which one gives a consistent result.

This is called "policy iteration": fix the policy, work out $V^{\pi}(A), Q^{\pi}(A, a), Q^{\pi}(A, b)$, improve the policy if necessary, until convergence. For this particular problem there are only 2 policies we need to consider, so this will lead to a small number of calculations.

Let's first calculate $V^{\pi}(A)$, where $\pi(A)=a$. We get $Q^{\pi}(A,a)=0.1V(B)+0.9V^{\pi}(A)$, $Q^{\pi}(A,d)=0.8V(B)+0.2V^{\pi}(A)$, and $V^{\pi}(A)=−0.1+Q^{\pi}(A,d)$. Fill in $V(B)=1$ and combine the above to get $Q^{\pi}(A,a)=0.1+0.9(−0.1+Q^{\pi}(A,a))=0.01+0.9Q^{\pi}(A,a)$. This gives us $Q^{\pi}(A,a)=0.1$. Now do the same for $Q^{\pi}(A,b)$, and you should get a higher value; that means we need to change ${\pi}$. Change $\pi$, and repeat the calculations.

  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/116238/discussion-on-answer-by-robby-the-belgian-mdp-and-sate-value-finding). – kjetil b halvorsen Nov 15 '20 at 11:34