Questions tagged [policy-iteration]
8 questions
12
votes
2 answers
Why does the policy iteration algorithm converge to optimal policy and value function?
I was reading Andrew Ng's lecture notes on reinforcement learning, and I was trying to understand why policy iteration converged to the optimal value function $V^*$ and optimum policy $\pi^*$.
Recall policy iteration is:
$
\text{Initialize $\pi$…

Charlie Parker
- 5,836
- 11
- 57
- 113
8
votes
0 answers
Why are the value and policy iteration dynamic programming algorithms?
Algorithms like policy iteration and value iteration are often classified as dynamic programming methods that try to solve the Bellman optimality equations.
My current understanding of dynamic programming is this:
It is a method applied to…

Karthik Thiagarajan
- 525
- 5
- 11
4
votes
0 answers
Convergence Proof of First Visit Monte Carlo Control
I am currently trying to find a formal proof of convergence for the Monte Carlo Reinforcement Learning Methods described in Sutton,Barto's Book "Reinforcement Learning - An Introduction" , Section 5.
They explain that along the ideas of generalized…

GreenLogic
- 193
- 6
2
votes
1 answer
Why the $\gamma^t$ is needed here in REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for $\pi_{*}$?
While reading PG method in Prof Sutton's RL book again, I found there is $\gamma^t$ in the last row (as shown below) in pseudo code. The book said
The second difference between the pseudocode update and the REINFORCE update
equation (13.8) is that…

GoingMyWay
- 1,111
- 2
- 13
- 25
2
votes
1 answer
Policy improvement in SARSA and Q learning
I have a rather trivial doubt in SARSA and Q learning. Looking at the pseudocode of the two algorithms in Sutton&Barto book, I see the policy improvement step is missing.
How will I get the optimal policy by the two algorithms? Are they used to find…

Jor_El
- 391
- 3
- 9
1
vote
1 answer
One small confusion on $\epsilon$-Greedy policy improvement based on Monte Carlo
I'm working on the RL book of Barto and Sutton, the author has provided the proof based on the policy improvement theorem, I can fully understand the inequality, but for the first equality, it really confuses me. why does $ q_{\pi}(s,\pi^{'}(s)) =…

FantasticAI
- 417
- 1
- 4
- 12
1
vote
0 answers
How to increase the total number of iterations it takes to converge a MDP?
I was reading about Policy Iteration. What are the factors that influence the total number of iterations the algorithm takes to converge?
For a given MDP which converges in 3 iterations, what setting needs to be influenced for the MDP so that the…

Amanda
- 111
- 1
1
vote
1 answer
Q-learning shows worse results than value iteration
I'm trying to solve the same problem with different algorithms (Travel max possible distance with a car). While using value iteration and policy iteration I was able to get the best results possible but with Q-learning it doesn't seem to go well.
My…

Most Wanted
- 255
- 1
- 13