Highest Voted 'policy-gradient' Questions - Statistical Analysis Stack Exchange

5

votes

1 answer

Reinforcement Learning - What is the logic behind actor-critic methods? Why use a critic?

Following David Silver's course, I came across the actor-critic policy improvement algorithm family. It holds For one-step Markov decision processes that $$\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[log \pi_{\theta}(s,a)*r]$$ where $J$ is…

asked Dec 03 '18 at 17:35

Gulzar

301
2
12

4

votes

0 answers

Why does the Policy Gradient Theorem generalize to continuous action spaces

The policy gradient is generally in the shape of the following: $$ L^{PG}(\theta) = \mathbb{E}_t \left[ \log \pi_\theta(a_t \mid s_t) A_t \right] $$ Where $\pi$ represents the probability of taking action $a_t$ at state $s_t$ and $A_t$ is an…

reinforcement-learning policy-gradient

asked Feb 08 '19 at 02:20

BadProgrammer

457
1
4
12

3

votes

3 answers

Is it possible to use DDPG for discrete action space?

In Deep Deterministic Policy Gradients(DDPG) method, we use two neural networks, one is Actor and the other is Critic. From actor-network, we can directly map states to actions (the output of the network directly the output) instead of outputting…

neural-networks reinforcement-learning policy-gradient

asked Aug 22 '19 at 00:07

Rachel

51
1
4

3

votes

1 answer

Why is there no Target Value function in PPO?

I just implemented the PPO algorithm in tensorflow and strictly followed the algorithm provided in the original PPO paper by Schulman et. al. 2017 Previously I did some experiments with the DDPG algorithm by Lillicrap et. al. 2016, in which they…

reinforcement-learning tensorflow policy-gradient

asked Apr 16 '19 at 10:25

flxh

217
1
7

3

votes

0 answers

REINFORCE calculating the log policy gradient for a continuous action space

I've noticed that when modelling a continuous action space, the default thing to do is to estimate a mean and a variance where each is parameterized by a neural network or some other model. I also often see that it is one network $\theta$ models…

reinforcement-learning policy-gradient

asked Feb 11 '19 at 16:47

tryingtolearn

499
5
11

3

votes

1 answer

Variance of reparameterization trick and score function

For a function $\mathbf E_{z\sim q_\phi(z|x)}[f(z)]$(assuming $f$ is continuous), where $q_\phi$ is a Gaussian distribution, if we want to compute the gradient w.r.t. $\phi$, we have two way to do that. compute the score function…

variance gradient-descent reinforcement-learning scoring-rules policy-gradient

asked Feb 02 '19 at 09:19

Maybe

775
7
15

2

votes

1 answer

Why the $\gamma^t$ is needed here in REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for $\pi_{*}$?

While reading PG method in Prof Sutton's RL book again, I found there is $\gamma^t$ in the last row (as shown below) in pseudo code. The book said The second difference between the pseudocode update and the REINFORCE update equation (13.8) is that…

reinforcement-learning policy-gradient policy-iteration

asked Sep 19 '21 at 13:10

GoingMyWay

1,111
2
13
25

2

votes

1 answer

Reinforcement Learning with Oracle Policy

I'm working on a reinforcement learning problem. The simulation environment is pretty simple (like those maze problems) so I can manually work out its solution. The idea I have is: since I can work out the optimal policy of the environment, is it…

reinforcement-learning q-learning policy-gradient

asked Aug 30 '21 at 19:32

DiveIntoML

1,583
1
11
21

2

votes

3 answers

Why the approximation of $\log \pi_{\theta}(a|s)$ improves numerical stability?

In Maxim Lapan's book Deep Reinforcement Learning Hands-on, section Continuous A2C, it says By definition, the probability density function of the Gaussian Distribution is $$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}}…

reinforcement-learning policy-gradient actor-critic

asked May 24 '20 at 22:44

jgauth

41
3

2

votes

1 answer

Reinforcement learning using the gradient of expected value doesn't lead to the optimal policy

I'm trying to learn more about reinforcement learning, and I've devised a very simple game as a thought experiment. The game consists of a single turn where the agent plays one of three possible cards. The first card, $c_0$ has a payoff of 1, the…

expected-value reinforcement-learning gradient-descent policy-gradient

asked May 14 '20 at 15:17

MingYun

23
2

2

votes

0 answers

Can't use replay memory with policy gradient, why?

One of the approaches to improving the stability of the Policy Gradient family of methods is to use multiple environments in parallel. The reason behind this is the fundamental problem we discussed in Chapter 6, Deep Q-Network, when we talked…

policy-gradient actor-critic

asked May 13 '20 at 13:26

jgauth

41
3

2

votes

0 answers

Train model on "bootstrapped" target?

Question I'd like to train a model in scikit-learn with the following input. Instead of having (X, y), I have (X, dy) where dy is the amount by which y ought to shift upon an update. What I'm thinking is that I could define my target y…

machine-learning scikit-learn reinforcement-learning policy-gradient

asked Mar 07 '19 at 21:59

Kris

261
1
6

2

votes

1 answer

Can Q-learning or SARSA be used to find an stochastic policy?

If the optimal policy is known to be stochastic (e.g. like in the stone, paper, scissors game), can this stochastic policy be found using SARSA or Q-learning, or is it only possible with policy gradient approaches?

reinforcement-learning q-learning sarsa policy-gradient

asked Nov 30 '17 at 10:51

aorj

33
4

1

vote

0 answers

Reinforcement learning: is a softmax policy actor-critic expected to work on mountain car?

I am following David Silver's RL course and I'm struggling to apply the Actor Critic concept to the Mountain Car environment. I am using a softmax policy with linear function approximation. I am also estimating the action value function with linear…

machine-learning reinforcement-learning policy-gradient

asked Sep 23 '21 at 19:26

psg_

11
2

1

vote

0 answers

Notation in Trust Region Policy Optimization by John Schulman et al

I am quite new to the area of reinforcement learning and find it hard to convice myself that the different notations used for reward function, state/action value function etc. coincide. Apparently I am not the only one and many people hope for an…

machine-learning markov-process reinforcement-learning notation policy-gradient

asked Sep 10 '21 at 09:31

Jannis H.

86
3

Questions tagged [policy-gradient]