In reinforcement learning, our goal is to optimize state-value function or action-value function, which are defined as following:
$V^{\pi}_s = \sum p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^{\pi}(s')]=E_{\pi}[r(s'|s,a)+\gamma V^{\pi}(s')|s_0=s]$
$Q^{\pi}(s,a) = \sum p(s'|s,s)[r(s'|s,a)+\gamma V^{\pi}(s')]=E_{\pi}[r(s'|s,a)+\gamma V^{\pi}(s')|s_0=s,a_0=a]$
However, when we use Q-learning method to get the optimal strategy, the update method is like following:
$Q(S,A) \leftarrow \ Q(S,A) + \alpha [R+\gamma max_a(Q(s',a)) -Q(S,A)]$
My question is:
why in Q-learning there is no transition probability $p(s'|s,a)$. Does it mean we don't need this $p$ when modeling MDP?