1

We know that Deep deterministic policy gradient (henceforth ddpg) is characterized by two kind of neural networks: one related to the critic $Q$ the other to the actor $\mu$ with parameters $\theta^\mu$ and $\theta^Q$ respectively. For stability issues, Lillicrap et al introduced two additional neural networks i.e. the critic target and the actor target with weights ${\theta^Q}^{'}$ and ${\theta^\mu}^{'}$ respectively.

Following the DDPG protocol, for each timestep the weights associated with the target networks actor get slowly updated i.e. :

${\theta^\mu}^{'} \leftarrow \tau {\theta^\mu} + (1-\tau){\theta^\mu}^{'}$

${\theta^Q}^{'} \leftarrow \tau {\theta^Q} + (1-\tau){\theta^Q}^{'}$

slowly because $\tau <<1$.

My question is: when my training is over, that is to say that the action provides a certain reward satisfying (if any) other conditions, which kind of neural network should I use for testing what I have just got? The actor or the target actor ?

Siderius
  • 11
  • 4

1 Answers1

0

The actor network should be used for the test/validation phase since the the actor weights get optimized both in the critic and the actor loss (see of course [1]). It is also true that at the end of the training there is no a huge difference between the actor and target actor because of the slow update.

[1] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

Siderius
  • 11
  • 4