Questions tagged [sarsa]
14 questions
2
votes
1 answer
Policy improvement in SARSA and Q learning
I have a rather trivial doubt in SARSA and Q learning. Looking at the pseudocode of the two algorithms in Sutton&Barto book, I see the policy improvement step is missing.
How will I get the optimal policy by the two algorithms? Are they used to find…

Jor_El
- 391
- 3
- 9
2
votes
1 answer
SARSA when policy is not epilon greedy
I would like to clarify a doubt that I have regarding SARSA. SARSA can be used for optimal control when the policy to take action $a$ is epsilon greedy. Suppose that the policy to take action $a$ is not an epsilon greedy one, but some other policy,…

calveeen
- 746
- 1
- 10
2
votes
1 answer
Can Q-learning or SARSA be used to find an stochastic policy?
If the optimal policy is known to be stochastic (e.g. like in the stone, paper, scissors game), can this stochastic policy be found using SARSA or Q-learning, or is it only possible with policy gradient approaches?

aorj
- 33
- 4
1
vote
0 answers
ε-Greedy with Q learning / SARSA can have stochastic policy?
Hello I'm now studying Q learning and SARSA with ε-Greedy , Softmax startegies. And have a question about my readings.
In my readings,
when SARSA with ε-Greedy, SARSA causes value-function oscilliations in case of stochastic polices. but I think…

BE LEO
- 11
- 1
1
vote
1 answer
In SARSA and Q-learning algorithms in RL, is policy updated during the iteration for Q-value learning?
In the video by Prof Brunskill "Stanford CS234 winter 2019 lecture 4" for model-free control (https://www.youtube.com/watch?v=j080VBVGkfQ), at 57:49/1:17:45, the pseudo code for SARSA includes line 8 for e-greedy update of the current policy pi. It…

Ruye
- 11
- 3
1
vote
0 answers
Expected SARSA, SARSA and Q-learning
I would much appreciate if you could point me in the right direction regarding this question about targets for approximate q-function for SARSA, Expected SARSA, Q-learning (notation: S is the current state, A is the current action, R is the reward,…

Novak
- 111
- 4
1
vote
0 answers
How sensitive is reinforcement learning to the neural network structure
I am trying out Sarsa deep reinforcement learning on OpenAI gym CartPole-v0 problem. The state has 4 continuous features and the action is binary with either 0 or 1. The state-action vector is then fed to a neural network to output the state-action…

Le Hoang Long
- 11
- 1
1
vote
1 answer
Differences between Sarsa and Q-learning control procedural algorithms
I am referring to pages 130-131 of Sutton and Barto book on Reinforcement Learning available here: book
I don't understand the slight difference that there is between the two procedural algorithms described respectively at page 130 for Sarsa and at…

hardhu
- 133
- 3
1
vote
0 answers
Convergence criterion for R-learning algorithm
I'm trying to find a policy for a simple game using R-learning algorithm. I have a field with values (agent can move in 4 directions) and the goal is to get from starting point to finish point with the highest score.
Final policy gives me…

Most Wanted
- 255
- 1
- 13
0
votes
0 answers
Building a simulator for continuous state, discrete action reinforcement learning
I am trying to build a simulator that optimizes the performance and temperature of a device. I want the device to perform well, but without making the device too hot. If the device becomes too hot, I want the internal circuitry to push down the…
0
votes
1 answer
Deduce the Bellman equation from the Value and Q functions
I am trying to derive/deduce the bellman equation using Value and Q-functions.
I came only so far with understanding it and tried it myself in Latex:
Why is the $V^*$ suddenly in $Q^\pi$ function? Why not $Q^\pi = r + \gamma Q^\pi(s_{t+1},…

johnny_1010
- 1
- 1
0
votes
1 answer
Purpose of trace-decay parameter in eligibility traces
In TD/SARSA-lambda, eligibility traces are decayed after each step by multiplying by the discount rate and the trace-decay parameter.
I understand that:
The discount rate is used to reduce the value of future actions relative to a state.
An…

Levi Botelho
- 103
- 3
0
votes
1 answer
Episodic Semi-gradient Q-learning for Estimating approximation of optimal action-value function
at page 244 of Sutton and Barto book on Reinforcement Learning (book) is described the pseudocode for episodic semi-gradient Sarsa, while it is never given a pseudocode for the corresponding episodic semi-gradient Q-learning.
I am aware of the…

hardhu
- 133
- 3
0
votes
2 answers
Different algorithms categorized in reinforcement learning
For some time I am going through reinforcement learning, and have found a lot of diverse information specially in area of Policies (algorithms).
I figured out that policies can be classified in On Vs Off, Model based vs Model Free, Also, these are…

Sandeep Bhutani
- 101
- 4