Why there is no transition probability in Q-Learning (reinforcement learning)?

Question

In reinforcement learning, our goal is to optimize state-value function or action-value function, which are defined as following:

$V^{\pi}_s = \sum p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^{\pi}(s')]=E_{\pi}[r(s'|s,a)+\gamma V^{\pi}(s')|s_0=s]$

$Q^{\pi}(s,a) = \sum p(s'|s,s)[r(s'|s,a)+\gamma V^{\pi}(s')]=E_{\pi}[r(s'|s,a)+\gamma V^{\pi}(s')|s_0=s,a_0=a]$

However, when we use Q-learning method to get the optimal strategy, the update method is like following:

$Q(S,A) \leftarrow \ Q(S,A) + \alpha [R+\gamma max_a(Q(s',a)) -Q(S,A)]$

My question is:

why in Q-learning there is no transition probability $p(s'|s,a)$. Does it mean we don't need this $p$ when modeling MDP?

Neil G · Answer 1 · 2017-01-01T21:55:54.477

6

Algorithms that don't learn the state-transition probability function are called model-free. One of the main problems with model-based algorithms is that there are often many states, and a naïve model is quadratic in the number of states. That imposes a huge data requirement.

Q-learning is model-free. It does not learn a state-transition probability function.

edited Jan 01 '17 at 21:55

answered Dec 18 '16 at 23:02

Neil G

13,633
3
41
84

1

However, in MDP, there is always a probability. If there is no transition probability, do it mean this is a contradictory to the basic assumption in Reinforcement Learning, since RL assume the process is Markov. – whatsname Dec 18 '16 at 23:21
3

@FzLbMj Of course the transition probabilities exist somewhere. The point is — like I said — that they are **not learned**. – Neil G Dec 18 '16 at 23:23
@nbro “model-based RL” – Neil G Jul 30 '18 at 00:09
@nbro Where is your definition of model-based RL coming from? – Neil G Jul 30 '18 at 15:53
1

@nbro Model-based means learning the dynamics of the environment. Here's a model that does that: Kuvayev, D., and Richard S. Sutton. Model-based reinforcement learning. Tech. rept. university of massachusetts, Dept of computer science, 1997. For reference, you can use [google scholar](https://scholar.google.com/scholar?q=%22model-based%22+reinforcement+learning&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on) when you don't know something. – Neil G Jul 30 '18 at 17:26
1

I just sent you a paper that you can read that has an algorithm learns the transition probabilities. See section 5. – Neil G Jul 30 '18 at 17:31
2

@nbro We obviously have a disagreement about definitions, so if you want to be convincing, please support your assertion with a reference. – Neil G Jul 31 '18 at 17:19

score 2 · Answer 2 · answered Dec 18 '16 at 20:34

For clarity, I think you should replace $max_a(Q', a)$ with $max_a(Q(S', a))$ as there is only one action-value function, we are just evaluating Q on actions in the next state. This notation also hints at where the $p(s'|s, a)$ lies.

Intuitively, $p(s'|s, a)$ is a property of the environment. We do not control how it works but simply sample from it. Before we call this update we first have to take an action A while in state S. The process of doing this gives us a reward and sends us to the next state. That next state that you land in is drawn from $p(s'|s, a)$ by it's definition. So in the Q-learning update we essentially assume $p(s'|s, a)$ is 1 because that is where we ended up.

This is ok because it's an iterative method where we are estimating the optimal action-value function without knowing the full dynamics of the environment and more specifically the value of $p(s|s', a)$. If you happen to have a model of the environment that gives you this information you can change the update to include it by simply changing the return to $\gamma p(S'|S, A)max_a(Q(S', a))$.

thank you very much for your reply. So, when we use Q-learning, we simply assume all the actions have an equal probability. BTW, do you have any idea about which method (`SARSA` or `Q-learning`) should use when dealing with different situations? thanks. — whatsname, Dec 18 '16 at 22:42
We don't assume all actions have an equal probability. We assume that the transition function is deterministic for our calculation. Meaning if you take the same action from the same state, you will arrive in the same next state. For Sarsa vs Q-learning look at here: http://stackoverflow.com/questions/6848828/reinforcement-learning-differences-between-qlearning-and-sarsatd — Alex, Dec 19 '16 at 03:19

score 1 · Answer 3 · answered Sep 12 '19 at 19:28

In addition to the above, Q-Learning is a model-free algorithm,that means that our agent just know the states what the environment gives to it. In other words, if an agent selects and performs an action, next state is determined by the environment only and gives to the agent. For that reason, the agent do not think about the state-transition probabilities.

Why there is no transition probability in Q-Learning (reinforcement learning)?

3 Answers3

Linked