Experience replay, why store SARS' and not SAQRS'

Question

For Q-learning Experience replay, why do we store into the bank observations:

{ stateFrom, actionIx, imediateReward, resultingState }

instead of

{ stateFrom, actionIx, actionQValue, imediateReward, resultingState }

The first choice seems to be preferred in all equations I see. [1] [2] But wouldn't it be good to already have $actionQValue$ at hand during backprop, so we don't have to spend time computing it again?

Or is there something fundimental, like we can't rely on such a actionQValue because it might become obsolete?

score 1 · Accepted Answer · answered May 29 '18 at 07:19

Or is there something fundimental, like we can't rely on such a $actionQValue$ because it might become obsolete?

Exactly that. The estimates for Q values, and the preferred choice of next action, are changing as the learning progresses. Action value estimates in a control problem are non-stationary as the agent improves its policy.

Experience replay is about re-using the observations from the environment - what reward R and resulting state S' occur when in state S, the agent takes action A.

You could code an agent to attempt to also re-use knowledge of its "previous self", such as what the action values were. However, this would be counter-productive, as the older estimates for Q will be biased towards initialisation values (typically all zero or random), and/or towards the action values of older policies that the agent has since improved upon.

Experience replay, why store SARS' and not SAQRS'

1 Answers1