For Q-learning Experience replay, why do we store into the bank observations:
{ stateFrom, actionIx, imediateReward, resultingState }
instead of
{ stateFrom, actionIx, actionQValue, imediateReward, resultingState }
The first choice seems to be preferred in all equations I see. [1] [2] But wouldn't it be good to already have $actionQValue$ at hand during backprop, so we don't have to spend time computing it again?
Or is there something fundimental, like we can't rely on such a actionQValue because it might become obsolete?