Why is there no Target Value function in PPO?

Question

I just implemented the PPO algorithm in tensorflow and strictly followed the algorithm provided in the original PPO paper by Schulman et. al. 2017

Previously I did some experiments with the DDPG algorithm by Lillicrap et. al. 2016, in which they employ a target q-function in order to stabilize the training. However in the PPO paper they do not seem to use a target v-function.

Why no target value function is needed in the PPO algorithm? And would there by any benefits in using a target v-function with soft updates in the PPO algorithm?

You can also see this discussion: https://www.reddit.com/r/reinforcementlearning/comments/jdykrr/why_is_a_target_network_not_needed_for_the_critic/ — GoingMyWay, Aug 18 '21 at 07:08

score 3 · Accepted Answer · answered Apr 16 '19 at 22:00

A target Q-function is employed because of the "moving target" problem: the target is dependent on the Q-network, but so is the prediction, which causes problems for convergence.

In actor-critic methods, the target is dependent on the actor, and the prediction is output by the critic, so there is no longer any moving target problem, and no reason to have a separate target network.

Why is there no Target Value function in PPO?

1 Answers1