Questions tagged [stochastic-policy]
10 questions
8
votes
2 answers
Is a policy always deterministic in reinforcement learning?
In reinforcement learning, is a policy always deterministic, or is it a probability distribution over actions (from which we sample)? If the policy is deterministic, why is not the value function, which is defined at a given state for a given policy…

MiloMinderbinder
- 1,622
- 2
- 15
- 31
3
votes
1 answer
Proof that any $\epsilon-$greedy policy is an improvement over any $\epsilon-$soft policy
In the book by Richard Sutton and Andrew Barto, "Reinforcement Learning - An Introduction", 2ed edition, at page 101 there is a proof, and I don't understand 1 passage of it.
We want to prove that any $\epsilon-$greedy policy with respect to an…

robertspierre
- 1,358
- 6
- 21
2
votes
1 answer
Policy improvement in SARSA and Q learning
I have a rather trivial doubt in SARSA and Q learning. Looking at the pseudocode of the two algorithms in Sutton&Barto book, I see the policy improvement step is missing.
How will I get the optimal policy by the two algorithms? Are they used to find…

Jor_El
- 391
- 3
- 9
1
vote
1 answer
How do measure how different two policies are?
I have two agents that both follow a baseline behavioral policy pi(a|s). If I then modify the state-action distribution for the two agents (resulting in two new policies), is there a standard measure I can use to tell how "far" the policies are from…

Dirk
- 111
- 2
1
vote
0 answers
ε-Greedy with Q learning / SARSA can have stochastic policy?
Hello I'm now studying Q learning and SARSA with ε-Greedy , Softmax startegies. And have a question about my readings.
In my readings,
when SARSA with ε-Greedy, SARSA causes value-function oscilliations in case of stochastic polices. but I think…

BE LEO
- 11
- 1
1
vote
1 answer
Discrete and continuous actions in the same environment
I am working on a RL environment that requires both discrete and continuous actions as input from the agent. I currently have a fine implementation of DDPG which I would like to use for the continuous part. But what about the discrete actions? Can…

franyx
- 11
- 1
1
vote
0 answers
Why the Monte Carlo Control algorithm is written this way?
I am having trouble to understand this algorithm, since this is not how I would have written it.
To me, we should first start to fix a policy. Then, we evaluate the Q values associated with this policy by doing exploration and reducing the…

Hugo Laurençon
- 51
- 4
0
votes
0 answers
What should be the policy for online reinforcement learning with intrinsic reward
An agent receives an extrinsic reward $r_{ext}$ and an intrinsic reward $r_{int}$ and a Q-function approximation is trained using TD learning such that $Q(s,a)$ approximates the expected return of $r_{ext} + \beta r_{int}$ where $\beta$ is a…

Kevin
- 1
- 1
0
votes
0 answers
How to prove that stochastic policy iteration converges?
I was reading Sutton's book Reinforcement Learning: An Introduction, especially policy iteration part.
There was a proof for convergence of policy iteration with deterministic policy.
So i tried to find the proof for the case of stochastic policy,…

nawab
- 1
0
votes
1 answer
Are the two $\epsilon$-greedy policies different?
I found 2 diffefent versions of $\epsilon$Greedy policy for monte carlo and q learning:
For monte carlo:
$\pi (a|s)=\epsilon /m +1-\epsilon$ to choose the best action and $\pi =\epsilon /m$ for other actions
For q learning:
$\pi (a|s)=1-\epsilon$ to…

abcd
- 1
- 1