Example of how the log-sum-exp trick works in Naive Bayes

Question

I have read about the log-sum-exp trick in many places (e.g. here, and here) but have never seen an example of how it is applied specifically to the Naive Bayes classifier (e.g. with discrete features and two classes)

How exactly would one avoid the problem of numerical underflow using this trick?

There are several examples of its use here, though not necessarily explicitly for *naive* Bayes. However, that hardly matters, since the idea of the trick is quite straightforward and readily adaptable. — Glen_b, Jul 02 '14 at 23:46
I'd suggest you try a search on *underflow*, and then update your question to more specifically address whatever is not covered already. — Glen_b, Jul 03 '14 at 00:07
Could you also clarify - this is Bernoulli-model naive Bayes? something else perhaps? — Glen_b, Jul 03 '14 at 01:59
See the example [here](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification), right at the bottom (just before 'See Also' where they take logs; exponentiating both sides but leaving the RHS "as-is" (as the exp of a sum of logs) would be an example of the log-sum-exp trick. Does that give you sufficient information relating to its use in Naive Bayes to ask a more specific question? — Glen_b, Jul 03 '14 at 02:14
Thanks @Glen_b I typed overflow but I certainly meant underflow. Yes, Bernoulli naive Bayes is fine. To clarify, I went through the literature on the log-sum-exp trick specifically in the context of NB (e.g. [here](http://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall07/NB.pdf) and [here](http://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall06/reading/NB.pdf), but I don't understand where exactly one takes exponentials, logs and sum. What I am missing is, **given the fractions**, how exactly does one apply logsumexp trick to get the final probability for a given input vector? — Josh, Jul 03 '14 at 02:26
Please be precise about which fractions. Indeed, since I have already pointed to an explicit expression that shows what you need to do (what you take logs of, what you sum) and I have then explicitly stated what you exponentiate, I fear I would simply be repeating myself, presumably to no more understanding on your part than you have from the available examples. So instead, please write a specific expression you want evaluated, and I'll attempt an answer in terms of that. — Glen_b, Jul 03 '14 at 02:30

score 32 · Accepted Answer · edited Apr 13 '17 at 12:44

In $$ p(Y=C|\mathbf{x}) = \frac{p(\mathbf{x}|Y=C)p(Y=C)}{~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k)} $$

both the denominator and the numerator can become very small, typically because the $p(x_i \vert C_k)$ can be close to 0 and we multiply many of them with each other. To prevent underflows, one can simply take the log of the numerator, but one needs to use the log-sum-exp trick for the denominator.

More specifically, in order to prevent underflows:

If we only care about knowing which class $(\hat{y})$ the input $(\mathbf{x}=x_1, \dots, x_n)$ most likely belongs to with the maximum a posteriori (MAP) decision rule, we don't have to apply the log-sum-exp trick, since we don't have to compute the denominator in that case. For the numerator one can simply take the log to prevent underflows: $log \left( p(\mathbf{x}|Y=C)p(Y=C) \right) $. More specifically:

$$\hat{y} = \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}}p(C_k \vert x_1, \dots, x_n) = \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \vert C_k)$$

which becomes after taking the log:

$$ \begin{align} \hat{y} &= \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \log \left( p(C_k \vert x_1, \dots, x_n) \right)\\ &= \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \log \left( \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \vert C_k) \right) \\ &= \underset{k \in \{1, \dots, |C|\}}{\operatorname{argmax}} \left( \log \left( p(C_k) \right) + \ \displaystyle\sum_{i=1}^n \log \left(p(x_i \vert C_k) \right) \right) \end{align}$$

If we want to compute the class probability $p(Y=C|\mathbf{x})$, we will need to compute the denominator:

$$ \begin{align} \log \left( p(Y=C|\mathbf{x}) \right) &= \log \left( \frac{p(\mathbf{x}|Y=C)p(Y=C)}{~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k)} \right)\\ &= \log \left( \underbrace{p(\mathbf{x}|Y=C)p(Y=C)}_{\text{numerator}} \right) - \log \left( \underbrace{~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k)}_{\text{denominator}} \right)\\ \end{align} $$

The element $\log \left( ~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)\\ $ may underflow because $p(x_i \vert C_k)$ can be very small: it is the same issue as in the numerator, but this time we have a summation inside the logarithm, which prevents us from transforming the $p(x_i \vert C_k)$ (can be close to 0) into $\log \left(p(x_i \vert C_k) \right)$ (negative and not close to 0 anymore, since $0 \leq p(x_i \vert C_k) \leq 1$). To circumvent this issue, we can use the fact that $p(x_i \vert C_k) = \exp \left( {\log \left(p(x_i \vert C_k) \right)} \right)$ to obtain:

$$\log \left( ~\sum_{k=1}^{|C|}{}p(\mathbf{x}|Y=C_k)p(Y=C_k) \right) =\log \left( ~\sum_{k=1}^{|C|}{} \exp \left( \log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right) \right) \right)$$

At that point, a new issue arises: $\log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)$ may be quite negative, which implies that $ \exp \left( \log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right) \right) $ may become very close to 0, i.e. underflow. This is where we use the log-sum-exp trick:

$$\log \sum_k e^{a_k} = \log \sum_k e^{a_k}e^{A-A} = A+ \log\sum_k e^{a_k -A}$$

with:
- $a_k=\log \left( p(\mathbf{x}|Y=C_k)p(Y=C_k) \right)$,
- $A = \underset{k \in \{1, \dots, |C|\}} \max a_k.$
We can see that introducing the variable $A$ avoids underflows. E.g. with $k=2, a_1 = - 245, a_2 = - 255$, we have:
- $\exp \left(a_1\right) = \exp \left(- 245\right) =3.96143\times 10^{- 107}$
- $\exp \left(a_2\right) = \exp \left(- 255\right) =1.798486 \times 10^{-111}$
Using the log-sum-exp trick we avoid the underflow, with $A=\max ( -245, -255 )=-245$: $\begin{align}\log \sum_k e^{a_k} &= \log \sum_k e^{a_k}e^{A-A} \\&= A+ \log\sum_k e^{a_k -A}\\ &= -245+ \log\sum_k e^{a_k +245}\\&= -245+ \log \left(e^{-245 +245}+e^{-255 +245}\right) \\&=-245+ \log \left(e^{0}+e^{-10}\right) \end{align}$

We avoided the underflow since $e^{-10}$ is much farther away from 0 than $3.96143\times 10^{- 107}$ or $1.798486 \times 10^{-111}$.

What if p(x | Y = C_k) or p(Y = C_k) are zero? Then we have a_k = log(0) — Federico Taschin, Apr 29 '20 at 16:19

score 2 · Answer 2 · answered Jul 03 '14 at 05:56

Suppose we want to identify which of two databases is more likely to have generated a phrase (for example, which novel is this phrase more likely to have come from). We could assume independence of the words conditional on the database (Naive Bayes assumption).

Now look up the second link you have posted. There $a$ would represent the joint probability of observing the sentence given a database and the $e^{b_{t}}$s would represent the probability of observing each of the words in the sentence.

Lerner Zhang · Answer 3 · 2020-03-22T02:19:51.520

We can see from this answer that the smallest number in Python(just take it for example) is 5e-324 due to the IEEE754, and the hardware cause applies to other languages as well.

In [2]: np.nextafter(0, 1)
Out[2]: 5e-324

And any float smaller than that would lead to 0.

In [3]: np.nextafter(0, 1)/2
Out[3]: 0.0

And let's see the function of Naive Bayes with discrete features and two classes as you required:

$$ p(S=1|w_1, ... w_n) = \frac{p(S=1) \prod_{i=1}^n p(\mathbf{w_i}|S=1)}{~\sum_{s=\{0, 1\}}p(S=s)\prod_{i=1}^n p(\mathbf{w_i}|S=s)} $$

Let me instantiate that function by a simple NLP task bellow.

We decide to detect if the coming email is spam($S=1$) or not spam($S=0$) and we have a word vocabulary of 5,000 in size($n=5,000$) and the only concern is if a word($w_i$) occurs($p(w_i|S=1)$) in the email or not($1-p(w_i|S=1)$) for simplicity(Bernoulli naive Bayes).

In [1]: import numpy as np
In [2]: from sklearn.naive_bayes import BernoulliNB
# let's train our model with 200 samples
In [3]: X = np.random.randint(2, size=(200, 5000))
In [4]: y = np.random.randint(2, size=(200, 1)).ravel()
In [5]: clf = BernoulliNB()
In [6]: model = clf.fit(X, y)

We can see that $p(S=s)\prod_{i=1}^n p(\mathbf{w_i}|S=s)$ would be very small because of the probabilities(both $p(w_i|S=1)$ and $1-p(w_i|S=1)$ would be between 0 and 1) in $\prod_i^{5000}$, and hence we are sure that the product would be smaller than $5e^{-324}$ and we just get $0/0$.

In [7]: (np.nextafter(0, 1)*2) / (np.nextafter(0, 1)*2)
Out[7]: 1.0

In [8]: (np.nextafter(0, 1)/2) / (np.nextafter(0, 1)/2)
/home/lerner/anaconda3/bin/ipython3:1: RuntimeWarning: invalid value encountered in double_scalars
  #!/home/lerner/anaconda3/bin/python
Out[8]: nan
In [9]: l_cpt = model.feature_log_prob_
In [10]: x = np.random.randint(2, size=(1, 5000))
In [11]: cls_lp = model.class_log_prior_
In [12]: probs = np.where(x, np.exp(l_cpt[1]), 1-np.exp(l_cpt[1]))
In [13]: np.exp(cls_lp[1]) * np.prod(probs)
Out[14]: 0.0

Then the problem arises: How can we calculate the probability of the email is a spam $p(S=1|w_1, ... w_n)$? Or how can we calculate the numerator and the denominator?

We can see the official implementation in sklearn:

jll = self._joint_log_likelihood(X)
# normalize by P(x) = P(f_1, ..., f_n)
log_prob_x = logsumexp(jll, axis=1)
return jll - np.atleast_2d(log_prob_x).T

For the numerator it converted the product of probabilities to the sum of log likelihood and for the denominator it used the logsumexp in scipy which is:

out = log(sum(exp(a - a_max), axis=0))
out += a_max

Because we cannot add two joint probabilities by adding its joint log likelihood, and we should go out from the log space to the probability space. But we cannot add the two true probabilities because they are too small and we should scale them and do the addition: $\sum_{s=\{0,1\}} e^{jll_s - max\_jll}$ and put the result back into the log space $\log\sum_{s=\{0,1\}} e^{jll_s - max\_jll}$ then rescale it back: $max\_jll+ \log\sum_{s=\{0,1\}} e^{jll_s - max\_jll}$ in log space by adding the $max\_jll$.

And here is the derivation:

$\begin{align} \log \sum_{s=\{0,1\}} e^{jll_s} & = \log \sum_{s=\{0,1\}} e^{jll_s}e^{max\_jll-max\_jll} \\& = \log e ^{max\_jll}+ \log\sum_{s=\{0,1\}} e^{jll_s - max\_jll} \\& = max\_jll+ \log\sum_{s=\{0,1\}} e^{jll_s - max\_jll} \end{align}$

where $max\_jll$ is the $a\_max$ in the code.

Once we get both the numerator and the denominator in log space we can get the log conditional probability($\log p(S=1|w_1, ... w_n)$) by subtracting the denominator from the numerator:

return jll - np.atleast_2d(log_prob_x).T

Hope that helps.

Reference:
1. Bernoulli Naive Bayes Classifier
2. Spam Filtering with Naive Bayes – Which Naive Bayes?

Example of how the log-sum-exp trick works in Naive Bayes

3 Answers3

Linked

Related