Small difference of q-function between different actions for the same state

Question

I am trying out reinforcement learning using Q-learning. The data come from some made-up equations so I have infinite number of data.

One thing that troubles me is after I learn the Q-function, I use $$argmax_a Q(s, a)$$ to pick action for state $s$. The value of $Q(s, a)$ is very close for different $a$ at the same state. For example, there are 3 possible actions, and the Q-function can have value 99999, 100000, 100001 for these actions. I'm pretty confident in my result because I can always generate more data and the Q-function converges pretty quickly, but is there anyway to remove this large constant during the learning process not at the end? Is this commonly observed and how should I deal with it?

score 0 · Answer 1 · answered Jan 03 '20 at 16:18

In Dueling DQN, the authors introduce a two-stream network, with one branch predicting the value function $V(s)$ and the other branch predicting the "advantage" $A(s, a)$, with $Q(s,a) = V(s) + A(s,a)$. Basically the advantage measures how much "additional reward" some action causes. In practice the two branches can't simply be summed (this is explained in more detail in the paper), but nonetheless the idea works.

This would solve your problem -- the advantage branch would be able to learn some small advantage values -- say -1, 0, and 1, whereas the value branch learns the large offset.

Small difference of q-function between different actions for the same state

1 Answers1