Questions tagged [stochastic-policy]

10 questions
8
votes
2 answers

Is a policy always deterministic in reinforcement learning?

In reinforcement learning, is a policy always deterministic, or is it a probability distribution over actions (from which we sample)? If the policy is deterministic, why is not the value function, which is defined at a given state for a given policy…
3
votes
1 answer

Proof that any $\epsilon-$greedy policy is an improvement over any $\epsilon-$soft policy

In the book by Richard Sutton and Andrew Barto, "Reinforcement Learning - An Introduction", 2ed edition, at page 101 there is a proof, and I don't understand 1 passage of it. We want to prove that any $\epsilon-$greedy policy with respect to an…
2
votes
1 answer

Policy improvement in SARSA and Q learning

I have a rather trivial doubt in SARSA and Q learning. Looking at the pseudocode of the two algorithms in Sutton&Barto book, I see the policy improvement step is missing. How will I get the optimal policy by the two algorithms? Are they used to find…
1
vote
1 answer

How do measure how different two policies are?

I have two agents that both follow a baseline behavioral policy pi(a|s). If I then modify the state-action distribution for the two agents (resulting in two new policies), is there a standard measure I can use to tell how "far" the policies are from…
1
vote
0 answers

ε-Greedy with Q learning / SARSA can have stochastic policy?

Hello I'm now studying Q learning and SARSA with ε-Greedy , Softmax startegies. And have a question about my readings. In my readings, when SARSA with ε-Greedy, SARSA causes value-function oscilliations in case of stochastic polices. but I think…
1
vote
1 answer

Discrete and continuous actions in the same environment

I am working on a RL environment that requires both discrete and continuous actions as input from the agent. I currently have a fine implementation of DDPG which I would like to use for the continuous part. But what about the discrete actions? Can…
1
vote
0 answers

Why the Monte Carlo Control algorithm is written this way?

I am having trouble to understand this algorithm, since this is not how I would have written it. To me, we should first start to fix a policy. Then, we evaluate the Q values associated with this policy by doing exploration and reducing the…
0
votes
0 answers

What should be the policy for online reinforcement learning with intrinsic reward

An agent receives an extrinsic reward $r_{ext}$ and an intrinsic reward $r_{int}$ and a Q-function approximation is trained using TD learning such that $Q(s,a)$ approximates the expected return of $r_{ext} + \beta r_{int}$ where $\beta$ is a…
Kevin
  • 1
  • 1
0
votes
0 answers

How to prove that stochastic policy iteration converges?

I was reading Sutton's book Reinforcement Learning: An Introduction, especially policy iteration part. There was a proof for convergence of policy iteration with deterministic policy. So i tried to find the proof for the case of stochastic policy,…
nawab
  • 1
0
votes
1 answer

Are the two $\epsilon$-greedy policies different?

I found 2 diffefent versions of $\epsilon$Greedy policy for monte carlo and q learning: For monte carlo: $\pi (a|s)=\epsilon /m +1-\epsilon$ to choose the best action and $\pi =\epsilon /m$ for other actions For q learning: $\pi (a|s)=1-\epsilon$ to…