What to do when your likelihood function has a double product with small values near zero - log transform doesn't work?

Question

I currently have a likelihood function defined as the following:

$$ L=\prod_{i=1}^{N}\left[\prod_{s=1}^{S_i}L_{is}(y\space|\space \rho_A)\times\phi + \prod_{s=1}^{S_i}L_{is}(y\space|\space \rho_B)\times(1-\phi)\right] $$

Here, $y$ is an observation, $\rho_A$ and $\rho_B$ are defined to be a vector set of parameters, and $\phi$ is just a variable.

What makes this likelihood different than the usual likelihoods I have seen is that not only are there are $N$ observations, but each observation has sub-values up to $S_i$, for $i=1,...,N$.

My problem right now is that the $L_{is}(y\space|\space \rho_A)$ are small to begin with, with some values around $0.000000001$ or even $0$. Therefore, if I multiply many of these small values, I get an even smaller result that R or most other languages will just treat as zero. However, taking the logarithm of $L$ only converts the first product into a sum, which doesn't help at all.

I've also tried to take the exponential, and then the log of the product, but I still have values that go straight to zero. Would anyone know of a prudent transformation to use? I have tried to use the log-transform method where I find the maximum, then re-scale, as in here: Converting (normalizing) very small likelihood values to probability. However, it appears that this would work for likelihood's with denominator's of zero. Thanks!

Please search our site for [underflow](http://stats.stackexchange.com/search?tab=votes&q=underflow), where you will find many useful recommendation for dealing with absurdly large or small numbers in statistical computations. — whuber, Mar 18 '15 at 14:40

score 9 · Answer 1 · answered Mar 18 '15 at 15:49

I recently had to deal with the same issue when computing conditional probabilities involving numbers on the order of $10^{-10000}$ (because normalizing the probability distribution would have required a great deal of unnecessary calculations). The heart of this difficulty, which comes up repeatedly in statistical computation, concerns taking logarithms of sums which themselves can overflow or underflow the computing platform's form of numeric representation (typically IEEE doubles). In the present case (to abstract away all irrelevant detail) we may write

$$L = \prod_i \left(\sum_j x_{ij}\right)$$

where the products $x_{i1} = \prod_{s=1}^{S_i}L_{is}(y\space|\space \rho_A)\phi$ and $x_{i2} = \prod_{s=1}^{S_i}L_{is}(y\space|\space \rho_B)(1-\phi)$ are themselves best computed using logarithms. Naturally we would like to compute $L$ as

$$L = \exp(\log(L));\quad \log(L) = \sum_i \log \left(\sum_j x_{ij}\right)$$

and in most applications, having $\log(L)$ is all that is really needed. There is the rub: how to compute logarithms of sums? Dropping the (now superfluous) subscript $i$, and assuming the $x_j$ will be computed using logarithms, we are asking how to calculate expressions like

$$\log\sum_j \exp(\log(x_j))$$

without over- or underflow.

(Note, too, that in many statistical applications it suffices to obtain a natural logarithm to an accuracy of only one or two decimal places, so high precision is not usually needed.)

The solution is to keep track of the magnitudes of the arguments. When $x_j \gg x_k$, then (to within a relative error of $x_k/x_j$) it is the case that $x_j + x_k = x_j$.

Often this magnitude-tracking can be done by estimating the order of magnitude of the sum (in some preliminary approximate calculation, for instance) and simply ignoring any term which is much smaller in magnitude. For more generality and flexibility, we can examine the relative orders of magnitude as we go along. This works best when summands have the most similar orders of magnitude, so that as little precision as possible is lost in each partial sum. Provided, then, that we arrange to sum the terms from smallest to largest, we may repeatedly apply this useful approximation and, in the end, lose no meaningful precision.

In many statistical applications (such as computing log likelihoods, which frequently involve Gamma and Beta functions) it is handy to have built-in functions to return the logarithms of factorials $n!$ and binomial coefficients $\binom{n}{k}$. Most statistical and numerical computing platforms have such functions--even Excel does (GAMMALN). Use these wherever possible.

Here is R code to illustrate how this could be implemented. So that it will serve as (executable) pseudocode, only simple Fortran-like expressions are used: they will port easily to almost any computing environment.

log.sum <- function(y, precision=log(10^17)) {
  # 
  # Returns the logarithm of sum(exp(y)) without overflow or underflow.
  # Achieves a relative error approximately e^(-precision).
  #
  # Some precision may be lost in computing `log(1+exp(*))`, depending on
  # the computing platform.
  #
  if (length(y) == 0) return(1)
  log.plus <- function(a, b) ifelse(abs(b-a) > precision, 
                                    max(b,a), max(b,a) + log(1 + exp(-abs(b-a))))
  y <- sort(y)
  x <- y[1]
  for (z in y[-1]) x <- log.plus(x, z)
  return (x)  
}

For example, let's compute $\log\sum_{i=0}^{1000}\exp(i) = 1000.45867514538708\ldots$ in R with and without this method:

x <- seq(0, 1000, by=1); print(log(sum(exp(x))), digits=18)

[1] Inf

Summation simply overflows. However,

print(log.sum(x), digits=18)

[1] 1000.45867514538713

is accurate to 17 full decimal places, which is all one can expect of IEEE doubles.

I think this problem is often approached via the "log-sum-exp trick" ([see e.g. here](https://hips.seas.harvard.edu/blog/2013/01/09/computing-log-sum-exp/)), which from the first sight looks different from your trick. The log-sum-exp trick is simply to compute $\log \sum \exp(x_i)$ as $A+\log \sum \exp(x_i-A)$, where one good choice for $A$ is $A=\max{x_i}$. This will prevent overflowing, and if some terms underflow then they are irrelevant anyway. — amoeba, Mar 18 '15 at 16:12
@Amoeba Thank you for sharing that. It is indeed a little different in that it applies an additive offset to $x_i$ rather than to $\log(x_i)$ (which is tantamount to a multiplicative offset to $x_i$). I also use it frequently. — whuber, Mar 18 '15 at 16:38
Another difference seems to be that your solution is iterative, whereas log-sum-exp trick offsets all terms together... I don't know if it matters in practice. +1, btw. — amoeba, Mar 18 '15 at 16:44
@amoeba It does matter a little in practice, because the iterative approach can be a bit slower. I offered it as an example because it is a relatively mindless, general-purpose procedure which illustrates the points that were previously made (but doesn't really add anything to them). The key idea is apparent both in your log-sum-exp trick and in the example here: use algebraic manipulations to make sure you never provide a bad argument to the exponential function. In so doing, you probably want to work with the logs as much as possible, because their antilogs cause over/underflow. — whuber, Mar 18 '15 at 16:52
@whuber thanks so much for the post, really really helps! I had a follow-up question to what you wrote, and its regarding what I should do with the $x_{i1}$ and $x_{i2}$ product calculation. I am not able to pass the log through so I was wondering what you meant by "the products...are themselves best computed using logarithms". Does this mean just taking the logarithm? Would that result in problems as I'm not passing the log from the outside? Many thanks! — user123276, Mar 20 '15 at 06:31
Is there any reference for this recursive summation? For the shifted version I have found [this](https://academic.oup.com/imajna/advance-article/doi/10.1093/imanum/draa038/5893596). — hakanc, Mar 04 '21 at 09:30
@hakanc Could you explain what you are referring to by "recursive summation"? — whuber, Mar 04 '21 at 14:25
I meant the for loop over $y$, which I see now is not a recursion but a reduction. — hakanc, Mar 04 '21 at 15:17

score 9 · Answer 2 · answered Mar 18 '15 at 16:50

9

This is just a small addition to @whuber's answer (+1).

To compute $$\log \sum \exp (x_i)$$ without overflowing or underflowing, a "log-sum-trick" is often used, see e.g. here for description. The trick is to compute $$\max(x_i) + \log \sum \exp (x_i - \max(x_i))$$ instead. This will prevent overflowing, and if some terms underflow then they are irrelevant anyway.

Using @whuber's example of $x_i=1\ldots 1000$, here is a demonstration in Matlab. Direct computation

x = 1:1000;
log(sum(exp(x)))

yields Inf, but

a = max(x) + log(sum(exp(x - max(x))));
display(num2str(a, 20))

yields 1000.4586751453871329, correct to 17 places, as in @whuber's answer.

answered Mar 18 '15 at 16:50

amoeba

93,463
28
275
317

hi, thanks for your response, for the calculation of these values, it seems we are doing it for each $i$. Why isn't it possible to find the maximum across all $i$ instead of just $j$? thanks! – user123276 Mar 20 '15 at 02:10
There is no $j$ in my answer... It is just a sum over $i$, and the maximum is taken over all $i$. Does it make sense? – amoeba Mar 20 '15 at 08:57
Thanks for the response, what I'm a bit confused about is how to apply it to my problem, since the notations are a bit different. If I wanted to use your case for my problem, would I just use $x_i = log(\prod_{i}L_{is})$? I guess what is confusing me is how my problem can be turned into a log-sum-exp problem, thanks! – user123276 Mar 20 '15 at 09:20
i tried it out and it worked! thank you thank you! much appreciated! – user123276 Mar 20 '15 at 11:47
No, wait, what I wrote in the comment above is wrong! Look, you want to compute $\log L = \log \prod_i (\prod_s A_{is} + \prod_s B_{is}) = \sum_i \log (\prod_s A_{is} + \prod_s B_{is}).$ The whole summation over $i$ is not a problem at all, the problem is the inner term. You can compute $a = \log A$ and $b=\log B$ and the problem is to compute $\log(A+B)$, right? Well, this is still a log-sum-exp, because $\log(A+B)=\log(\exp \log A + \exp \log B)=\log(\exp a + \exp b).$ So it's log-sum-exp situation but with only two terms in the sum! This is $j=1,2$ in @whuber's answer. – amoeba Mar 20 '15 at 12:05
Thanks for the response! I am just having one little issue. It is that sometimes my data yields $A_{is}=0$ or $B_{is}=0$, so that $logA$ and $logB$ go to `-Inf`. Would you know what to do in this case? Would setting the problematic $A_{is}$ or $B_{is}$ to a ridiculously small number like `1e-1000` be good enough? Thank you thank you!!!! – user123276 Mar 20 '15 at 12:33
But if one $A_{is}=0$, it would mean that the whole product $\prod_s A_{is}=0$. Isn't it a problem? Half of your expression for $L$ vanishes. If at least one $A_{is}=0$ and at least one $B_{is}=0$, then you don't need to compute anything, because $L=0$. – amoeba Mar 20 '15 at 12:36

What to do when your likelihood function has a double product with small values near zero - log transform doesn't work?

2 Answers2

Linked

Related