We know that Deep deterministic policy gradient (henceforth ddpg) is characterized by two kind of neural networks: one related to the critic $Q$ the other to the actor $\mu$ with parameters $\theta^\mu$ and $\theta^Q$ respectively. For stability issues, Lillicrap et al introduced two additional neural networks i.e. the critic target and the actor target with weights ${\theta^Q}^{'}$ and ${\theta^\mu}^{'}$ respectively.
Following the DDPG protocol, for each timestep the weights associated with the target networks actor get slowly updated i.e. :
${\theta^\mu}^{'} \leftarrow \tau {\theta^\mu} + (1-\tau){\theta^\mu}^{'}$
${\theta^Q}^{'} \leftarrow \tau {\theta^Q} + (1-\tau){\theta^Q}^{'}$
slowly because $\tau <<1$.
My question is: when my training is over, that is to say that the action provides a certain reward satisfying (if any) other conditions, which kind of neural network should I use for testing what I have just got? The actor or the target actor ?