4

The policy gradient is generally in the shape of the following:

$$ L^{PG}(\theta) = \mathbb{E}_t \left[ \log \pi_\theta(a_t \mid s_t) A_t \right] $$

Where $\pi$ represents the probability of taking action $a_t$ at state $s_t$ and $A_t$ is an advantage estimator.

This makes perfect sense to me in discrete action spaces. However, I'm unsure why this still makes sense in continuous action spaces. In every application of policy gradients to continuous action spaces that I have seen, $\pi$ always evaluates a point on the PDF instead of actually representing a probability. Why is this possible?

BadProgrammer
  • 457
  • 1
  • 4
  • 12
  • It involves making the transition from probabilities to probability densities but in the framework of the algorithm that's merely a terminology. In the discrete case $\sum_a \pi(a|s)$ must hold while in the latter it is $\int \pi(a|s)\,da$. In the end one can think of the transition as an infinitesimally small discretization and probabilities are given on intervals $[a, a+da]$, i.e. $\pi(a_t \in [a, a+da] | s)$. – a_guest Feb 08 '19 at 15:22
  • but when we evaluate (|) we are given the point on the pdf, not the infinitesimally small discrete region (∈[,+]|) – BadProgrammer Feb 09 '19 at 07:23
  • 1
    @nbro, you seem to have created the `[policy-gradient]` tag. Please consder creating an excerpt for it. – gung - Reinstate Monica Feb 15 '19 at 03:14

0 Answers0