140

Artificial intelligence website defines off-policy and on-policy learning as follows:

"An off-policy learner learns the value of the optimal policy independently of the agent's actions. Q-learning is an off-policy learner. An on-policy learner learns the value of the policy being carried out by the agent including the exploration steps."

I would like to ask your clarification regarding this, because they don't seem to make any difference to me. Both the definitions seem like they are identical. What I actually understood are the model-free and model-based learning, and I don't know if they have anything to do with the ones in question.

How is it possible that the optimal policy is learned independently of the agent's actions? Isn't the policy learned when the agent performs the actions?

cgo
  • 7,445
  • 10
  • 42
  • 61
  • 4
    I added a comment to http://stackoverflow.com/questions/6848828/reinforcement-learning-differences-between-qlearning-and-sarsatd/41420616#41420616, the **TL;NR** part might be helpful with the understanding, too. – zyxue Jan 02 '17 at 04:27
  • 1
    here is a good explanation https://nb4799.neu.edu/wordpress/?p=1850 – Ivan Kush Jun 20 '17 at 18:50
  • 1
    I would also like to add that there is an off-policy variant of SARSA. This paper (http://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/vanseijenadprl09.pdf) will review on and off policy in the introduction, and then explain expected sarsa. Also lookup expected policy gradients (EPG) to find a more general theory that meshes the two types. – Josh Albert Jun 18 '18 at 11:27
  • 1
    I found this blog really helpful: https://leimao.github.io/blog/RL-On-Policy-VS-Off-Policy/ – raksheetbhat Nov 23 '19 at 04:17
  • Maybe this could be useful: [On-Policy v/s Off-Policy Learning](https://towardsdatascience.com/on-policy-v-s-off-policy-learning-75089916bc2f) – Francesco Lucianò Feb 07 '21 at 11:59

7 Answers7

174

First of all, there's no reason that an agent has to do the greedy action; Agents can explore or they can follow options. This is not what separates on-policy from off-policy learning.

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state $s'$ and the greedy action $a'$. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state $s'$ and the current policy's action $a''$. It estimates the return for state-action pairs assuming the current policy continues to be followed.

The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.

Have you looked at the book available for free online? Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. Second edition, MIT Press, Cambridge, MA, 2018.

Tannex
  • 5
  • 2
Neil G
  • 13,633
  • 3
  • 41
  • 84
  • 15
    nice explanation! Your example on Q-learning is better formulated that in Sutton's book, which says: "_the learned action-value function, Q, directly approximates Q* , the optimal action-value function, independent of the policy being followed. This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs. The policy still has an effect in that it determines which state-action pairs are visited and updated._" – Ciprian Tomoiagă Jan 17 '17 at 13:50
  • 18
    In general, I don't find Sutton and Barto very readable at all. I find the explanations they offer are not very comprehensible. I am not sure why their book gets recommended all over the place – N.S. Jun 07 '18 at 06:22
  • @S.N. For many students of reinforcement learning, Sutton and Barto is the first book they read. – Neil G Jun 07 '18 at 06:25
  • @NeilG What is the second book to read? Since Sutton&Barto is so new and covers things like AlphaGo, and most of current RL is even beyond that. Any tips are welcome. – Jakub Arnold Feb 06 '19 at 21:37
  • 5
    @JakubArnold the original Sutton & Barto book is from 1998 and it does not cover deep reinforcement learning. The 2nd edition only mentions things like AlphaGo, but the focus of the book is in more classical approaches. If you want more RL resources, take a look at [this list](https://medium.com/@yuxili/resources-for-deep-reinforcement-learning-a5fdf2dc730f). I suggest David Silver's videos and Puterman's book as they are more approachable. For more theoretical material, I recommend Bertsekas' books. Take a look at the Spinning Up website for DRL algorithms and links to original papers. – Douglas De Rizzo Meneghetti Feb 14 '19 at 11:03
  • Great explanation. But I'm still confused by why Sarsa is on-policy and Q-learning is off-policy. If you look at Sutton & Barto's book, the ONLY difference between Sarsa (for estimating optimal policy q*, S&B book p.130) and Q-learning is that when you update the Q, whether you choose A' epsilon-greedy or not. If it's purely greedy, then it's Q-learning, if not, it's sarsa. So in this case, it does depend on exploration or not. By the way, Q-learning is also known as Sarsa-max. – Albert Chen Jul 10 '19 at 13:50
  • 2
    @AlbertChen "So in this case, it does depend on exploration or not": No, because both algorithms explore. The difference is how Q is updated. – Neil G Jul 15 '19 at 20:57
  • So it's all about whether evaluation policy is same as behaviour policy? – dzieciou Jul 03 '20 at 13:36
  • @dzieciou Depending on your defintiions, that's right. – Neil G Jul 03 '20 at 13:42
  • @NeilG: You wrote about Q-Learning "In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy." So whenever there is no greedy policy it is called Off-policy-learning? – PeterBe Aug 19 '21 at 11:52
  • @PeterBe No. See my conversation in these comments. – Neil G Aug 20 '21 at 00:08
  • @NeilG: You wrote in your answer that Q-learning is just assuming a greedy policy and not following it. First of all using a greedy policy can lead to suboptimal solutions. This would mean that Q-learning will eventually lead to suboptimal decisions. Further, which policy is Q-learning really using if it is not using a greedy-policy? From an optimization perspective you would need to use dynamic programming to get an optimal result. So is Q-learning using dynamic programming? – PeterBe Aug 20 '21 at 06:37
  • @PeterBe "Which policy is Q-learning really using"—doesn't matter. "So is Q-learning using dynamic programming?"—Yes. – Neil G Aug 20 '21 at 07:31
  • @NeilG: Thanks for your answers. Why does it not matter which policy Q-learning is using? To calculate the Q-value you have to use a policy because the Q-value gives you the discounted future reward, as far as I understood. Of course this reward depends on the policy. Further, if Q-learning uses Dynamic programming what is the difference between Q-learning and dynamic programming? Or the other way round: If you can optimally solve a optimization problem with dynamic programming, why would you use Q-learning instead? – PeterBe Aug 20 '21 at 08:02
  • 1
    @peterbe Please just ask other questions. There are many errors in your reasoning. – Neil G Aug 20 '21 at 08:25
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/128763/discussion-between-peterbe-and-neil-g). – PeterBe Aug 20 '21 at 09:24
  • @NeilG: Where and how does Q-learning use dynamic programming as you pointed out? Dynamic programming solves the optmization problem without any Q-learning in the field of operations research. So I do not understand how they are combined – PeterBe Aug 27 '21 at 12:43
48

First of all, what actually policy (denoted by $\pi$) means?
Policy specifies an action $a$, that is taken in a state $s$ (or more precisely, $\pi$ is a probability, that an action $a$ is taken in a state $s$).

Second, what types of learning do we have?
1. Evaluate $Q(s,a)$ function: predict sum of future discounted rewards, where $a$ is an action and $s$ is a state.
2. Find $\pi$ (actually, $\pi(a|s)$), that yields to a maximum reward.

Back to the original question. On-policy and off-policy learning is only related to the first task: evaluating $Q(s,a)$.

The difference is this:
In on-policy learning, the $Q(s,a)$ function is learned from actions that we took using our current policy $\pi(a|s)$.
In off-policy learning, the $Q(s,a)$ function is learned from taking different actions (for example, random actions). We even don't need a policy at all!

This is the update function for the on-policy SARSA algorithm: $Q(s,a) \leftarrow Q(s,a)+\alpha(r+\gamma Q(s',a')-Q(s,a))$, where $a'$ is the action, that was taken according to policy $\pi$.

Compare it with the update function for the off-policy Q-learning algorithm: $Q(s,a) \leftarrow Q(s,a)+\alpha(r+\gamma \max_{a'}Q(s',a')-Q(s,a))$, where $a'$ are all actions, that were probed in state $s'$.

Rasoul
  • 243
  • 1
  • 3
  • 9
Dmitry Mottl
  • 581
  • 4
  • 3
  • 1
    *"In off-policy learning, the $Q(s,a)$ function is learned from taking different actions (for example, random actions). We even don't need a policy at all!"* - How can you not have a policy? Isn't even taking random actions technically a policy? Also it would be helpful if you could illucidate the difference in the Q update between SARSA and Q-Learning that show what makes either on policy or off policy. – alex Jun 08 '20 at 03:43
  • 2
    @alex If I understand correctly, a policy is a function of current state and environment, while taking random actions would not take current state/environment into account. I guess you could have a function that just outputs random actions no matter the input, but then whether that's an actual "policy" is debatable. – chimbo Mar 16 '21 at 21:49
  • So can we say that algorithms like policy gradient are a mix of on and off-policy learning? Because there is this exploration-exploitation rate that tells the RL when to be greedy and when to explore – Sarvagya Gupta Nov 11 '21 at 18:26
  • 1
    I don't really understand this distinction. Isn't $a' = \pi(a'|s')$ just equal to $a' = max_{a'} Q(s', a')$ in the off policy case? That's still a policy, just not one you need to store separately from the Q function. Does the distinction lie in the way we often keep a separate target Q in some flavors of RL that we only update with information from the working Q every $k$ iterations? – Pavel Komarov Nov 15 '21 at 17:12
  • @PavelKomarov Let's assume we use $\varepsilon$-greedy policy. Then in the on-policy formula $a'$ in $Q(s', a')$ stands for the $\varepsilon$-greedy action which is different from $\mathrm{argmax}_{a'} Q(s', a')$. – Appliqué Mar 04 '22 at 19:09
27

On-policy methods estimate the value of a policy while using it for control.

In off-policy methods, the policy used to generate behaviour, called the behaviour policy, may be unrelated to the policy that is evaluated and improved, called the estimation policy.

An advantage of this seperation is that the estimation policy may be deterministic (e.g. greedy), while the behaviour policy can continue to sample all possible actions.

For further details, see sections 5.4 and 5.6 of the book Reinforcement Learning: An Introduction by Barto and Sutton, first edition.

10

The difference between Off-policy and On-policy methods is that with the first you do not need to follow any specific policy, your agent could even behave randomly and despite this, off-policy methods can still find the optimal policy. On the other hand on-policy methods are dependent on the policy used. In the case of Q-Learning, which is off-policy, it will find the optimal policy independent of the policy used during exploration, however this is true only when you visit the different states enough times. You can find in the original paper by Watkins the actual proof that shows this very nice property of Q-Learning. There is however a trade-off and that is off-policy methods tend to be slower than on-policy methods. Here a link with other interesting summary of the properties of both types of methods

Juli
  • 240
  • 2
  • 7
  • 2
    Off-policy methods are not only slower, but can be unstable when combined with bootstrapping (i.e. how Q-learning builds estimates from each other) and function approximators (e.g. neural networks). – Neil Slater Sep 01 '17 at 16:50
3

From the Sutton book: "The on-policy approach in the preceding section is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “o↵” the target policy, and the overall process is termed o↵-policy learning."

Identicon
  • 31
  • 1
2

This is the recursive version of the Q-function (according to Bellman equation):

$$Q_\pi(s_t,a_t)=\mathbb{E}_{\,r_t,\,s_{t+1}\,\sim\,E}\left[r(s_t,a_t)+\gamma\,\mathbb{E}_{\,a_{t+1}\,\sim\,\pi}\left[Q_\pi(s_{t+1}, a_{t+1})\right]\right]$$

Notice that the outer expectation exists because the current reward and the next state are sampled ($\sim)$ from the environment ($E$). The inner expectation exists because the Q-value for the next state depends on the next action. If you your policy is deterministic, there is no inner expectation, our $a_{t+1}$ is a known value that depends only on the next state, let's call it $A(s_{t+1})$:

$$Q_{det}(s_t,a_t)=\mathbb{E}_{\,r_t,\,s_{t+1}\,\sim\,E}\left[r(s_t,a_t)+\gamma\,Q_{det}(s_{t+1}, A(s_{t+1})\right]$$

This means the Q-value depends only on the environment for deterministic policies.

The optimal policy is always deterministic (it always take the action that leads to higher expected reward) and Q-learning directly approximates the optimal policy. Therefore the Q-values of this greedy agent depends only on the environment.

Well, if the Q-values depends only on the environment, it doesn't matter how I explore the environment, that is, I can use an exploratory behaviour policy.

João Pedro
  • 151
  • 3
2

On-policy learning: The same (ϵ-greedy) policy that is evaluated and improved is also used to select actions. For eg. SARSA TD Learning Algorithm

Off-policy learning: The (greedy) policy that is evaluated and improved is different from the (ϵ-greedy) policy that is used to select actions. For eg. Q-Learning Algorithm

Sushil Thapa
  • 116
  • 5