Questions tagged [policy-gradient]

41 questions
5
votes
1 answer

Reinforcement Learning - What is the logic behind actor-critic methods? Why use a critic?

Following David Silver's course, I came across the actor-critic policy improvement algorithm family. It holds For one-step Markov decision processes that $$\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[log \pi_{\theta}(s,a)*r]$$ where $J$ is…
4
votes
0 answers

Why does the Policy Gradient Theorem generalize to continuous action spaces

The policy gradient is generally in the shape of the following: $$ L^{PG}(\theta) = \mathbb{E}_t \left[ \log \pi_\theta(a_t \mid s_t) A_t \right] $$ Where $\pi$ represents the probability of taking action $a_t$ at state $s_t$ and $A_t$ is an…
BadProgrammer
  • 457
  • 1
  • 4
  • 12
3
votes
3 answers

Is it possible to use DDPG for discrete action space?

In Deep Deterministic Policy Gradients(DDPG) method, we use two neural networks, one is Actor and the other is Critic. From actor-network, we can directly map states to actions (the output of the network directly the output) instead of outputting…
3
votes
1 answer

Why is there no Target Value function in PPO?

I just implemented the PPO algorithm in tensorflow and strictly followed the algorithm provided in the original PPO paper by Schulman et. al. 2017 Previously I did some experiments with the DDPG algorithm by Lillicrap et. al. 2016, in which they…
flxh
  • 217
  • 1
  • 7
3
votes
0 answers

REINFORCE calculating the log policy gradient for a continuous action space

I've noticed that when modelling a continuous action space, the default thing to do is to estimate a mean and a variance where each is parameterized by a neural network or some other model. I also often see that it is one network $\theta$ models…
tryingtolearn
  • 499
  • 5
  • 11
3
votes
1 answer

Variance of reparameterization trick and score function

For a function $\mathbf E_{z\sim q_\phi(z|x)}[f(z)]$(assuming $f$ is continuous), where $q_\phi$ is a Gaussian distribution, if we want to compute the gradient w.r.t. $\phi$, we have two way to do that. compute the score function…
2
votes
1 answer

Why the $\gamma^t$ is needed here in REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for $\pi_{*}$?

While reading PG method in Prof Sutton's RL book again, I found there is $\gamma^t$ in the last row (as shown below) in pseudo code. The book said The second difference between the pseudocode update and the REINFORCE update equation (13.8) is that…
GoingMyWay
  • 1,111
  • 2
  • 13
  • 25
2
votes
1 answer

Reinforcement Learning with Oracle Policy

I'm working on a reinforcement learning problem. The simulation environment is pretty simple (like those maze problems) so I can manually work out its solution. The idea I have is: since I can work out the optimal policy of the environment, is it…
DiveIntoML
  • 1,583
  • 1
  • 11
  • 21
2
votes
3 answers

Why the approximation of $\log \pi_{\theta}(a|s)$ improves numerical stability?

In Maxim Lapan's book Deep Reinforcement Learning Hands-on, section Continuous A2C, it says By definition, the probability density function of the Gaussian Distribution is $$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}}…
2
votes
1 answer

Reinforcement learning using the gradient of expected value doesn't lead to the optimal policy

I'm trying to learn more about reinforcement learning, and I've devised a very simple game as a thought experiment. The game consists of a single turn where the agent plays one of three possible cards. The first card, $c_0$ has a payoff of 1, the…
2
votes
0 answers

Can't use replay memory with policy gradient, why?

One of the approaches to improving the stability of the Policy Gradient family of methods is to use multiple environments in parallel. The reason behind this is the fundamental problem we discussed in Chapter 6, Deep Q-Network, when we talked…
jgauth
  • 41
  • 3
2
votes
0 answers

Train model on "bootstrapped" target?

Question I'd like to train a model in scikit-learn with the following input. Instead of having (X, y), I have (X, dy) where dy is the amount by which y ought to shift upon an update. What I'm thinking is that I could define my target y…
2
votes
1 answer

Can Q-learning or SARSA be used to find an stochastic policy?

If the optimal policy is known to be stochastic (e.g. like in the stone, paper, scissors game), can this stochastic policy be found using SARSA or Q-learning, or is it only possible with policy gradient approaches?
1
vote
0 answers

Reinforcement learning: is a softmax policy actor-critic expected to work on mountain car?

I am following David Silver's RL course and I'm struggling to apply the Actor Critic concept to the Mountain Car environment. I am using a softmax policy with linear function approximation. I am also estimating the action value function with linear…
1
vote
0 answers

Notation in Trust Region Policy Optimization by John Schulman et al

I am quite new to the area of reinforcement learning and find it hard to convice myself that the different notations used for reward function, state/action value function etc. coincide. Apparently I am not the only one and many people hope for an…
1
2 3