I would much appreciate if you could point me in the right direction regarding this question about targets
for approximate q-function
for SARSA, Expected SARSA, Q-learning (notation: S
is the current state, A
is the current action, R
is the reward, S’
is the next state and A’
is the action chosen from that next state). I've written my thoughts next to each question/statement:
Does the Q-learning target computation require the probability of current policy to select the
A'
(action that was actually made in the environment) inS'
?- I'm not sure what probability of policy means? Target in Q-learning doesn't depend on the action taken (
A'
). But, behavior policy does have some randomness (epsilon
- probability of taking the random action) which determines which action will be taken in environment (random or action that maximizes Q-value). Does the question refer to that 'probability of current policy'? But generally, I think this is not correct since I don't see anything similar to probability in Q-function update (except thatepsilon
).
- I'm not sure what probability of policy means? Target in Q-learning doesn't depend on the action taken (
Do we need an explicit policy for the Q-learning target to sample
A’
from. And for SARSA?- I guess this is true for Q-learning since we need to get max Q-value which determines which action
A'
we'll use for update. For SARSA we update theQ(S, A)
depending on which action was actually taken (no need for max)
- I guess this is true for Q-learning since we need to get max Q-value which determines which action
Is this statement true:
All methods (SARSA, Ex. SARSA, Q-learning) require R and S’ to perform updates.
?- All methods require S, A, R, S'> In the statement, only a subset of required parameters is mentioned. Does it make it true or not since the FULL set of parameters is left out?
Is the difference between SARSA and Q-learning targets only in how
A’
inS’
is selected?- I would say that this is not correct but I'm not entirely sure. Based on some code I've seen on github, both of them select the next action in the exactly same way, but they differ in how they update parameters (SARSA updates parameters based on action actually taken in environment and Q-learning on best possible action regardless of which action was actually taken).