Is a policy always deterministic in reinforcement learning?

Question

In reinforcement learning, is a policy always deterministic, or is it a probability distribution over actions (from which we sample)? If the policy is deterministic, why is not the value function, which is defined at a given state for a given policy $\pi$ as follows

$$V^{\pi}(s) = E\left[\sum_{t>0} \gamma^{t}r_t|s_0 = s, \pi\right]$$

a point output?

In the above definition, we take an expectation. What is this expectation over?

Can a policy lead to different routes?

A related question on StackOverflow: https://stackoverflow.com/q/46260775/712995 — Maxim, Jan 07 '18 at 13:33

score 8 · Accepted Answer · answered Dec 01 '17 at 20:37

8

There are multiple questions here: 1. Is a policy always deterministic? 2. If the policy is deterministic then shouldn't the value also be deterministic? 3. What is the expectation over in the value function estimate? Your last question is not very clear "Can a policy lead to routes that have different current values?" but I think you mean: 4. Can a policy lead to different routes?

A policy is a function can be either deterministic or stochastic. It dictates what action to take given a particular state. The distribution $\pi(a\mid s)$ is used for a stochastic policy and a mapping function $\pi:S \rightarrow A$ is used for a deterministic policy, where $S$ is the set of possible states and $A$ is the set of possible actions.
The value function is not deterministic. The value (of a state) is the expected reward if you start at that state and continue to follow a policy. Even if the policy is deterministic the reward function and the environment might not be.
The expectation in that formula is over all the possible routes starting from state $s$. Usually, the routes or paths are decomposed into multiple steps, which are used to train value estimators. These steps can be represented by the tuple $(s,a,r,s')$ (state, action, reward, next state)
This is related to answer 2, the policy can lead to different paths (even a deterministic policy) because the environment is usually not deterministic.

answered Dec 01 '17 at 20:37

A.D

2,114
3
17
27

can you give me an example of environment not being deterministic? as i see it, if the agent applies action $a$ to an environment in state $s$, it deterministically changes environment to $s^`$ – MiloMinderbinder Dec 01 '17 at 20:53
1

A classical example is a robot that takes moves left by one step (action) but the surface is slippery (walking on ice) so it actually moves 2 steps left. In fact such environments are the norm and extensively studied. My example is actually as well known "toy" environment :https://gym.openai.com/envs/FrozenLake-v0/ – A.D Dec 01 '17 at 20:59
so state $s$ and action $a$ upon it leads to a prob dist over $s^`$. i got that right? – MiloMinderbinder Dec 01 '17 at 21:04
Yes, just like $p(a\mid s)$ is stochastic, $p(s' \mid s, a)$ is also stochastic. – A.D Dec 01 '17 at 21:05
just two more things: 1. $p(a|s)$ is stochastic only in stochastic policy right? 2. Can you confirm the other answer posted is wrong about what the expectation is taken over part so i can accept your answer? – MiloMinderbinder Dec 01 '17 at 21:11
1. yes, 2. the other answer is not "wrong" but maybe less clear... the decomposed steps $(s,a,r,s')$ are usually stored and used "training examples" to learn a value function. The other answer is specific to an implementation (IMO) – A.D Dec 01 '17 at 21:44
Should the value function be deterministic given the state? You are taking the expectation, and V(s) itself is not a random variable, right? – Albert Chen Jun 20 '19 at 20:46
@A.D just want to clarify, you can say s' or a is random, but you can't say p(a|s) is random. stochastic or random is used to describe a variable. Actually, if the environment is stable, the probability density p(a|s) itself doesn't change. – Albert Chen Jun 20 '19 at 20:49

score 1 · Answer 2 · answered Dec 01 '17 at 19:59

1

The policy can be stochastic or deterministic. The expectation is over training examples given the conditions. The value function is an estimate of the return, which is why it's an expectation.

answered Dec 01 '17 at 19:59

Neil G

13,633
3
41
84

Is a policy always deterministic in reinforcement learning?

2 Answers2