What is the difference between policy-based, on-policy, value-based, off-policy, model-free and model-based?

Question

I'm trying to clear things out for myself, there are a lot of different categorizations within RL. Some people talk about:

On-policy & Off-Policy
Model-based & Model-free
Model-based, Policy-based & Value-based (+ Actor-Critic= Policy-based+Value-based)

It seems like there is some overlap, which led me to the next understanding:

Model-based

Model-free:

Policy-based = On-policy:
- Deterministic
- Stochastic
Value-based = Off-Policy
Actor-Critic = Value-based(Actor) + Policy-based(Critic)

Is this understanding right or are they all completely different categorizations?

what is RL? What assumptions do you mean? What you list looks more like a list of what you understand things to mean, rather than what mathematicians consider assumptions. — ReneBt, May 08 '19 at 10:11
@ReneBt Thanks for your comment! I edited the question, hopefully this clarifies it. — Dave Ouds, May 08 '19 at 10:16

Tomasz Bartkowiak · Answer 1 · 2021-05-20T12:59:23.397

Here is a quick summary on the Reinforcement Learning taxonomy:

On-policy vs. Off-Policy

This division is based on whether you update your $Q$ values based on actions undertaken according to your current policy or not. Let's say your current policy is a completely random policy. You're in state $s$ and make an action $a$ that leads you to state $s'$. Will you update your $Q(s, a)$ based on the best possible action you can take in $s'$ or based on an action according to your current policy (random action)? The first choice method is called off-policy and the latter - on-policy. E.g. Q-learning does the first and SARSA does the latter.

Policy-based vs. Value-based

In Policy-based methods we explicitly build a representation of a policy (mapping $\pi: s \to a$) and keep it in memory during learning.

In Value-based we don't store any explicit policy, only a value function. The policy is here implicit and can be derived directly from the value function (pick the action with the best value).

Actor-critic is a mix of the two.

Model-based vs. Model-free

The problem we're often dealing with in RL is that whenever you are in state $s$ and make an action $a$ you might not necessarily know the next state $s'$ that you'll end up in (the environment influences the agent).

In Model-based approach you either have an access to the model (environment) so you know the probability distribution over states that you end up in, or you first try to build a model (often - approximation) yourself. This might be useful because it allows you to do planning (you can "think" about making moves ahead without actually performing any actions).

In Model-free you're not given a model and you're not trying to explicitly figure out how it works. You just collect some experience and then derive (hopefully) optimal policy.

score 9 · Answer 2 · 2019-05-20T10:59:39.547

You can have an on-policy RL algorithm that is value-based. An example of such algorithm is SARSA, so not all value-based algorithms are off-policy. A value-based algorithm is just an algorithm that estimates the policy by first estimating the associated value function.

To understand the difference between on-policy and off-policy, you need to understand that there are two phases of an RL algorithm: the learning (or training) phase and the inference (or behaviour) phase (after the training phase). The distinction between on-policy and off-policy algorithms only concerns the training phase.

During the learning phase, the RL agent needs to learn an estimate of the optimal value (or policy) function. Given that the agent still does not know the optimal policy, it often behaves sub-optimally. During training, the agent faces a dilemma: the exploration or exploitation dilemma. In the context of RL, exploration and exploitation are different concepts: exploration is the selection and execution (in the environment) of an action that is likely not optimal (according to the knowledge of the agent) and exploitation is the selection and execution of an action that is optimal according to the agent's knowledge (that is, according to the agent's current best estimate of the optimal policy). During the training phase, the agent needs to explore and exploit: the exploration is required to discover more about the optimal strategy, but the exploitation is also required to know even more about the already visited and partially known states of the environment. During the learning phase, the agent thus can't just exploit the already visited states, but it also needs to explore possibly unvisited states. To explore possibly unvisited states, the agent often needs to perform a sub-optimal action.

An off-policy algorithm is an algorithm that, during training, uses a behaviour policy (that is, the policy it uses to select actions) that is different than the optimal policy it tries to estimate (the optimal policy). For example, $Q$-learning often uses an $\epsilon$-greedy policy ($\epsilon$ percentage of the time it chooses a random or explorative action and $1-\epsilon$ percentage of the time it chooses the action that is optimal, according to its current best estimate of the optimal policy) to behave (that is, to exploit and explore the environment), while, in its update rule, because of the $\max$ operator, it assumes that the greedy action (that is, the current optimal action in a given state) is chosen.

An on-policy algorithm is an algorithm that, during training, chooses actions using a policy that is derived from the current estimate of the optimal policy, while the updates are also based on the current estimate of the optimal policy. For example, SARSA is an on-policy algorithm because it doesn't use the $\max$ operator in its update rule.

The difference between $Q$-learning (off-policy) and SARSA (on-policy) is respectively the use or not of the $\max$ operator in their update rule.

In the case of policy-based or policy search algorithm (e.g. REINFORCE), the distinction between on-policy and off-policy is often not made because, in this context, there isn't usually a clear separation between a behaviour policy (the policy to behave during training) and a target policy (the policy to be estimated).

You can think of actor-critic algorithms as value and policy-based because they use both a value and policy functions.

The usual examples of model-based algorithms are value and policy iterations, which are algorithms that use the transition and reward functions (of the given Markov decision process) to estimate the value function. However, it might be the case that you also have on-policy, off-policy, value-based or policy-based algorithms that are model-based, in some way, that is, they might use a model of the environment in some way.

I think you made a type in the on-policy paragraph " For example, SARSA is an off-policy algorithm" shouldn't it be on-policy? Also I'm missing the context on the max operator, which I assume indicates my unfamiliarity with the field. But thanks for the information! — Dave Ouds, May 20 '19 at 10:17

What is the difference between policy-based, on-policy, value-based, off-policy, model-free and model-based?

2 Answers2

On-policy vs. Off-Policy

Policy-based vs. Value-based

Model-based vs. Model-free

Linked