Why the approximation of $\log \pi_{\theta}(a|s)$ improves numerical stability?

Question

In Maxim Lapan's book Deep Reinforcement Learning Hands-on, section Continuous A2C, it says

By definition, the probability density function of the Gaussian Distribution is $$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}$$ We could directly use this formula to get the probabilities, but to improve numerical stability, it is worth doing some math and simplifying the expression for $\log \pi_{\theta}(a|s)$.

The final result will be this: $\log \pi_{\theta}(a|s) = \frac{-(x-u)^2}{2 \sigma^2} - \log \sqrt{2 \pi \sigma^2}$

Can you explain to me how to get the approximation of $\log \pi_{\theta}(a|s)$? How does it improve the numerical stability?

You are missing a negative sign in the exponential in the initial density function. Taking logarithms gives you the result (which is exact, not an approximation). — Ben, May 24 '20 at 23:31

Demetri Pananos · Answer 1 · 2020-05-24T23:46:32.423

If you take the log of $f(x\vert \mu, \sigma^2)$, you get the expression you want. I'm not sure why the author wants to change from $f$ to $\pi_\theta$, but perhaps he explains the notation elsewhere earlier in the book.

As for numerical stability, you're likely going to be talking about likelihood functions sooner or later. The likelihood is the product of density values evaluated at the data. It looks like

$$ \prod_i f(x_i, \mu, \sigma^2)$$

If many of the $f(x_i\vert \mu , \sigma)$ are smaller than one, we risk underflowing when computing this product. Working with the log-likelihood combats this problem while retaining all the information about the location of the optima of the likelihood.

Aksakal · Answer 2 · 2020-05-25T14:25:54.510

Look at the standard normal PDF and the range of values within $x=[0,9]$:

The Gaussian PDF values go from 0.4 to 1e-18, while their log varies from -0.9 to -41. Numerical algorithms are less stable when the values vary too much, especially when the sensitivity is so different at different values of x. Logarithm of PDF is very steady though.

This is especially important when you have products of PDF, such as likelihood expressions.

score 1 · Answer 3 · answered Aug 12 '20 at 10:34

This computational principle applies in a wide range of probability problems involving density and mass functions. The reason for this advice is that probability density values can be very small positive numbers, and when you try to compute these directly from the density formula you can get a problem of arithmetic underflow. This occurs when numbers involved in the computation are smaller than the smallest positive number represented by the computational facility, so they get rounded down to zero in the computation. This causes inaccuracy in computation when you are dealing with formulae involving very small positive numbers.

By transferring over to "log-space" you convert small probabilities into non-small negative numbers, and this massively reduces the problem of arithmetic overflow. For this reason, algorithms used to compute density and mass functions almost always conduct their computations in log-space and then exponentiate to get back to regular probability space at the end of the computation. (For details on computing in log-space see related questions here and here.)

Why the approximation of $\log \pi_{\theta}(a|s)$ improves numerical stability?

3 Answers3