1

For Q-learning Experience replay, why do we store into the bank observations:

{ stateFrom, actionIx, imediateReward, resultingState }

instead of

{ stateFrom, actionIx, actionQValue, imediateReward, resultingState }

The first choice seems to be preferred in all equations I see. [1] [2] But wouldn't it be good to already have $actionQValue$ at hand during backprop, so we don't have to spend time computing it again?

Or is there something fundimental, like we can't rely on such a actionQValue because it might become obsolete?

Kari
  • 115
  • 8

1 Answers1

1

Or is there something fundimental, like we can't rely on such a $actionQValue$ because it might become obsolete?

Exactly that. The estimates for Q values, and the preferred choice of next action, are changing as the learning progresses. Action value estimates in a control problem are non-stationary as the agent improves its policy.

Experience replay is about re-using the observations from the environment - what reward R and resulting state S' occur when in state S, the agent takes action A.

You could code an agent to attempt to also re-use knowledge of its "previous self", such as what the action values were. However, this would be counter-productive, as the older estimates for Q will be biased towards initialisation values (typically all zero or random), and/or towards the action values of older policies that the agent has since improved upon.

Neil Slater
  • 6,089
  • 20
  • 24