I am trying out reinforcement learning using Q-learning. The data come from some made-up equations so I have infinite number of data.
One thing that troubles me is after I learn the Q-function, I use $$argmax_a Q(s, a)$$ to pick action for state $s$. The value of $Q(s, a)$ is very close for different $a$ at the same state. For example, there are 3 possible actions, and the Q-function can have value 99999, 100000, 100001 for these actions. I'm pretty confident in my result because I can always generate more data and the Q-function converges pretty quickly, but is there anyway to remove this large constant during the learning process not at the end? Is this commonly observed and how should I deal with it?