Expectation of $\ln(1 + e^x)$, where $x$ is normally distributed

Question

I need to evaluate the following integral:

$$\int_{-\infty}^\infty\mathrm d x \exp\left(-\frac{(x-\mu)^2}{2\nu}\right) \ln(1+e^x)$$

where $\mu$ is a finite real number and $\nu > 0$. This is just the average value of $\ln(1+e^x)$, when $x$ is normally distributed with mean $\mu$ and variance $\nu$.

I think this might not have a closed form solution (but see https://math.stackexchange.com/q/3423097/10063). Anticipating that a closed form solution might not exist, are there any suggestions on how I can evaluate this numerically?

Using the Taylor expansion of $\ln(1+x)$ around $x=0$ I get the series representation $\sum_{n=1}^{\infty} (-1)^{n+1} \frac{1}{n} \exp(n\mu+\frac{1}{2}n^2\nu)$, but as far as numerical approximation goes, this probably has terrible convergence. — StijnDeVuyst, Nov 05 '19 at 15:14
Another thing that maybe helps is that for large $x$, we have $\ln(1+e^x)\approx x$, so the integral is going to be close to $\mu\sqrt{2\pi\nu}$ if $\mu/\sqrt{\nu}$ is positive and sufficiently large, say at least 5 or so. — StijnDeVuyst, Nov 05 '19 at 15:18
@StijnDeVuyst I obtain a closed analytical expression if I approximate $\ln(1+e^x)\approx \max(0,x)$ in terms of the error function. But I was hoping for something a more accurate than this. — becko, Nov 05 '19 at 15:22
If $v$ is small a [Laplace approximation](https://en.wikipedia.org/wiki/Laplace%27s_method) may work well. — Jarle Tufto, Nov 05 '19 at 15:23
You could split the integration in three parts say $x < -a$, $-a < x < b$ and $x > b$ so that $\log(1+e^x)$ is close to $0$ and to $x$ on the first and third part. Then the second part could be coped with by using a numerical quadrature. — Yves, Nov 05 '19 at 15:36
In case anyone is interested, we have a similar question (not a duplicate) which examines the ReLU function in place of the softplus function. https://stats.stackexchange.com/questions/392226/what-would-be-the-output-distribution-of-relu-activation — Sycorax, Nov 05 '19 at 16:13
Based on the remark of @Yves: you could approximate $\ln(1+e^x)$ by a piece-wise linear function with some strategically chosen discontinuity points in the neighborhood of $x=0$ where the function has a lot of curvature. The nice thing is that for each linear piece, the integral can be evaluated in closed form if you don't mind a lot of evaluations of $\Phi(\cdot)$. You can make this as accurate as you want by introducing more discontinuity points. — StijnDeVuyst, Nov 05 '19 at 17:09
Another nice thing is that the error can be easily controlled. On $(-\infty,\, -a)$ we have $\log(1 + e^x) \leq e^x \leq e^{-a}$. On $(b,\, \infty)$ we have $0 \leq \log(1 + e^x) - x \leq e^{-x} \leq e^{-b}$. So taking $b=-a=10$ gives a nice precision for the integrals on the unbounded intervals whatever be $\mu$ and $\nu$. — Yves, Nov 05 '19 at 17:17

Jesper for President · Answer 1 · 2019-11-07T21:45:25.890

Gauss-hermite (GH) quadrature is a very efficient method compared to the Monte Carlo method (MC). If you care about the trade off between precision versus speed (here measured by the number of function evaluations) GH is far superior to MC sampling for the specific integral here under consideration.

The software I use for solving the integral by Gauss-Hermite procedure approximates the integral

$$(1) \ \ \int_{-\infty}^\infty \exp\left(-z^2\right)g(z) dz$$

as a weighted sum

$$\approx \sum_i w(z_i)g(z_i)$$

The procedure is to select the number of points $N$ here referred to as the sample size. The software will then return $w_1,...,w_N$ weights and $z_1,...,z_N$ nodes. These nodes are plugged into the function $g()$ to get $g_i = g(z_i)$ and then the user computes $$\sum_i w_i g_i.$$

To apply the procedure the first step is therefore to rewrite the integral $\int_{-\infty}^\infty \exp\left(-z^2\right)g(z) dz$ to the integral under consideration - up to a known constant of proportionality - which can be done by letting $$z = (x-\mu)/(\sqrt{2v})$$ and by letting $$g(z) = \log(1+\exp(\mu +\sqrt{2v}z)).$$ By substitution it follows that

$$(2) \ \ \int_{-\infty}^\infty \exp\left(-z^2\right)g(z) dz = \frac{1}{2v}\int_{-\infty}^\infty \exp\left(- \frac{(x-\mu)^2}{2v} \right)\log(1+\exp(x))dx$$

since the R code approximates the integral on the left hand side the result has to be multiplied with the constant $2v$ to get the result the OP is looking for. The integral is then solved in R by the following code

library(statmod)
g <- function(z) log(1+exp(mu + sqrt(2*v)*z))
hermite <- gauss.quad(50,kind="hermite")
sum(hermite$weights * g(hermite$nodes))*sqrt(2*v)

The method is here no different than the MC method because the MC method also presuppose that the integral under consideration can be written on a certain form more precisely

$$\int_{-\infty}^{\infty}h(x)f(x) dx$$

where $h(x)$ is a distribution it is possible to sample from. In the current case this is particularly easy because the OP's problem can be written as

$$\sqrt{2\pi v}\int_{-\infty}^\infty\frac{1}{\sqrt{2\pi v}} \exp\left(- \frac{(x-\mu)^2}{2v} \right)\log(1+\exp(x))dx$$

The OP's problem is then given as $$\sqrt{2\pi v} \ \mathbb E[\log(1+\exp(x))] \approx \sqrt{2\pi v} \frac{1}{N} \sum_i \log(1+\exp(x_i))$$ for a sample $\{x_i\}_{i=1}^N$ where $x_i$ is a random normal draw of mean $\mu$ and variance $v$. This can be done using the following R code

f <- function(x) log(1+exp(x))
w <- rnorm(1000,mean=mu,sd=sqrt(v))
sqrt(2*pi*v)*mean(f(w))

Rough Comparison

To make a rough comparison first note that both the MC method and the GH method makes an approximation of the form

$$\sum_{i=1}^N weight_i \cdot F(s_i)$$

it therefore seems reasonable to compare the methods on the precision they achieve for a given sample size $N$. In order to do this I use the R function integrate() to find the "TRUE" value of the integral and then calculate the deviation from this assumed true value for different sample sizes $N$ for both the MC method and the GH method. The result is displayed in the plot which clearly show how the GH method gets very close even for a small number of function evaluations as indicated by the red line quickly going to 0 whereas MC integration is very slowly goes to zero.

The plot is generated using the following code

library(statmod)
mu <- 0
v <- 10

f <- function(x) log(1+exp(x))
g <- function(z) log(1+exp(mu + sqrt(2*v)*z))
h <- function(x) exp(-(x-mu)^2/(2*v))*log(1+exp(x))

# Monte Carlo integration using function f
w <- rnorm(1000,mean=mu,sd=sqrt(v))
sqrt(2*pi*v)*mean(f(w)) 

# Integration using build in R function
int.sol <- integrate(h,-500,500)
int.sol$value

# Integration using Gauss-Hermite
hermite <- gauss.quad(50,kind="hermite")
sum(hermite$weights * g(hermite$nodes))*sqrt(2*v)

# Now wolve for different sample sizes
# Set up the sample sizes
sample_size <- 1:300

error <- matrix(0,ncol=length(sample_size),nrow=2)
for (i in 1:length(sample_size))
    {
        MC <- rep(0,1000)
        for (j in 1:1000)
        {
            w <- rnorm(sample_size[i],mean=mu,sd=sqrt(v))
            MC[j] <- sqrt(2*pi*v)*mean(f(w)) 
        }
        error[1,i] <- mean(abs(MC - int.sol$value))

        hermite <- gauss.quad(sample_size[i],kind="hermite")
        her.sol <- sum(hermite$weights * g(hermite$nodes))*sqrt(2*v)
        error[2,i] <- (abs(her.sol - int.sol$value))
    }
plot(sample_size,error[1,],ylim=c(0,10),type="l",ylab="error")
points(sample_size,error[2,],type="l",lwd=2,col="red")

Just tried out your suggestion from the comments to my answer, but the result is not even close to the correct answer: `quad.par — cdalitz, Nov 07 '19 at 12:03
You probably did not do the necessary change of variables, I get the right results. — Jesper for President, Nov 07 '19 at 14:45
@jseper-hybel "necessary change of variables": I give up and acknowlegde that the MC approach is much more viable ;-) — cdalitz, Nov 07 '19 at 15:03
the kernel in is explicitly written as $exp(-z^2)$ in the documentation of the R code. So you have to set $z = (x-\mu)/\sqrt(2v)$ just like you have to realize that the kernel in the problem stated is only the normal up to a difference in the constant $1/sqrt(2*pi*v)$ which is why you have to realize that you have to multiply your MC average with the constant sqrt(2*pi*v) as you have done in your code. But I will post my solution. — Jesper for President, Nov 07 '19 at 15:10
Thank you very much for working out your solution! Is there also a way, like for the MC solution and R's *integrate* function, to give an estimate of the error? — cdalitz, Nov 08 '19 at 09:46
Yes I am sure it is possible but I do not know the details of it. After all integrate() is just another application of quadrature so you can look in its source code to see how error estimate is generated. A dirty technique but easy way is simply to perform the integration using alot of nodes and then calculate error for trials with fewer nodes. My point here is simply that you do not wanna do MC with N=1000000 when you get the same precision using N=50 with gauss hermite (or Gauss-Kronrod as you used). — Jesper for President, Nov 08 '19 at 10:11
@StopClosingQuestionsFast I read with great interest your answer and `R` code for GH integration. I am desperate about finding a clear example for a multivariate integration`with GH quadrature in `R`. Would it be possible to provide a complete example with a 2-dimensional integration ? This would be so helpful for me! — Flora Grappelli, Jan 22 '20 at 17:12
I have not had the chance to dig into that problem. https://www-r--bloggers-com.cdn.ampproject.org/v/s/www.r-bloggers.com/notes-on-multivariate-gaussian-quadrature-with-r-code/amp/?amp_js_v=a2&amp_gsa=1&usqp=mq331AQCKAE%3D#aoh=15797275043398&referrer=https%3A%2F%2Fwww.google.com&amp_tf=Fra%20%251%24s&ampshare=https%3A%2F%2Fwww.r-bloggers.com%2Fnotes-on-multivariate-gaussian-quadrature-with-r-code%2F perhaps this can help. — Jesper for President, Jan 22 '20 at 21:27

score 6 · Answer 2 · edited Nov 08 '19 at 22:24

6

Like you mentioned, This is just the average value of $\ln(1+e^x)$, when x is normally distributed with mean μ and variance ν. So all you have to do is:

1) Draw N (large number) of $X_i \sim N(\mu, \nu)$

2) Your estimate $\hat\theta$ is then:

$ \hat\theta = 1/N * \sqrt{2\pi\nu }\sum^N_{i =1}\ln(1+e^{X_i}) $

edited Nov 08 '19 at 22:24

Jesper for President

5,049
1
18
41

answered Nov 05 '19 at 15:12

Kane Chua

186
3

5

Oh yes I thought of this. But I was aiming for a deterministic numerical evaluation, if possible. – becko Nov 05 '19 at 15:21

cdalitz · Answer 3 · 2019-11-06T13:50:31.173

5

Out of curiosity, I have tried the suggestions of both answers with the following R-code (note that the integrand $\log(1+e^x)$ must be approximated by $x$ for large values to avoid overflow):

f <- function(x, mu, s) {
    # approximate integrand for large values
    log.part <- ifelse(x<100, log(1+exp(x)), x)
    return(exp(-(x-mu)^2/(2*s*s)) * log.part)
}

m <- 1; s <- 1

# Monte Carlo integration
N <- 10^6
x <- rnorm(N, mean=m, sd=s)
int.mc <- sqrt(2*pi)*s * mean(log(1+exp(x)))
int.mc.sd <- sqrt(2*pi)*s * sd(log(1+exp(x))) / sqrt(N)
cat(sprintf("Monte Carlo: %f +/- %f\n", int.mc, 2*int.mc.sd))

# numerical integration
int.num <- integrate(f, lower=-Inf, upper=Inf, mu=m, s=s)
cat(sprintf("Numerical:   %f +/- %f\n", int.num$value, int.num$abs.error))

It turns out that, for this integral, the results are close to each other, but the numerical integration (R uses a Gauss-Kronrod quadrature in connection with extrapolation by Wynn's Epsilon algorithm) is both faster and more accurate (the Monte Carlo "accuracy" is a 95% confidence interval):

Monte Carlo: 3.525610 +/- 0.003548
Numerical:   3.526466 +/- 0.000001

edited Nov 06 '19 at 13:50

answered Nov 06 '19 at 12:41

cdalitz

2,844
6
13

1

Thanks. I don't use `R`, but I bet there is a function `log1pexp(x)` you can use in place of `log(1+exp(x))` to avoid the intermediate overflow you mention? – becko Nov 06 '19 at 12:49
See https://www.rdocumentation.org/packages/qgam/versions/1.3.0/topics/log1pexp. – becko Nov 06 '19 at 12:50
@becko: just tried this function, and it does not have any effect on the result. The simple asymptotic formula is thus a good choice for $x>100$. – cdalitz Nov 06 '19 at 13:00
@JesperHybel Well this answer was posted 20 minutes ago, while the MC answer was yesterday. So give it more time. – becko Nov 06 '19 at 13:05
2

@jesper-hybel A possible reason why the MC answer is preferred by many readers might be that it is easy to implement and even includes the code how to do it. Actually I could only try out the numerical integration by relying on a third party library, while I could implement the MC integration myself. If you add coding time to the performance time, MC is often significantly faster ;-) – cdalitz Nov 06 '19 at 13:54
A better approximation for large $x$ is $x+e^{-x}$, maple gives asympt( log(1+exp(x)), x) as $x+ \left( {{\rm e}^{x}} \right) ^{-1}-1/2\, \left( {{\rm e}^{x}} \right) ^{-2}+1/3\, \left( {{\rm e}^{x}} \right) ^{-3}-1/4\, \left( { {\rm e}^{x}} \right) ^{-4}+1/5\, \left( {{\rm e}^{x}} \right) ^{-5}+O \left( \left( {{\rm e}^{x}} \right) ^{-6} \right) $ – kjetil b halvorsen Nov 25 '19 at 11:26

wolfies · Answer 4 · 2019-11-06T15:26:14.520

5

The following diagram compares the desired expectation $E[\ln \left(1+e^x\right)]$ to the 45 degree asymptote $\mu$ (when the variance is 1), and the $x$-axis asymptote when $\mu$ is negative.

That suggests a very simple approximation when $\mu$ is either relatively large negative (0) or relatively large positive ($\mu$). More generally, increasing the variance will push outwards the range of 'contact / excellent approximation'. If you play around with it a bit, I suspect it will be possible to come up with some fairly simple rule like $|\frac{\mu}{2\sqrt{\sigma}}| > 3$ or similar where the asymptote approximation will work nicely.

The following implements Becko's suggestion (see comment below) of using $E[\max[0,X]]$ as the approximation (see dashed line) for which closed-form solutions exist. This is equivalent to the value of an option in a world where the parent random variable is Normal (as opposed to Lognormal).

It works remarkably well when $\sigma$ is large.

edited Nov 06 '19 at 15:26

answered Nov 06 '19 at 13:53

wolfies

6,963
1
22
27

2

I obtained a closer approximation by replacing $\ln(1+e^x)\approx\max(0,x)$, for which the average can be computed analytically using error functions. But as you said this is not very accurate when $\mu$ is close to zero. – becko Nov 06 '19 at 14:30
1

That is very nice indeed - I was thinking the curve just looked like a call option! – wolfies Nov 06 '19 at 15:06
1

I like the graphs, insight, and intuition from this answer. It shows very clearly where approximations work well and where they don't. I wonder whether the curve E[ln(1+e^x)] can't be approximated with some simple function. It looks very much like $E[ln(1+e^x)] \approx ln(1+e^\mu)$ – Sextus Empiricus Nov 07 '19 at 21:51
@SextusEmpiricus I noticed you are based in Sion which I recognise because I sometimes attend the Verbier Festival, if I am in Europa. – wolfies Jan 18 '20 at 16:06
@wolfies I can give a numbers tour in Sion, whenever you are around. – Sextus Empiricus Jan 22 '20 at 12:33

Expectation of $\ln(1 + e^x)$, where $x$ is normally distributed

4 Answers4

Linked