We can see from this answer that the smallest number in Python(just take it for example) is 5e-324
due to the IEEE754, and the hardware cause applies to other languages as well.
In [2]: np.nextafter(0, 1)
Out[2]: 5e-324
And any float smaller than that would lead to 0.
In [3]: np.nextafter(0, 1)/2
Out[3]: 0.0
And let's see the function of Naive Bayes with discrete features and two classes
as you required:
$$
p(S=1|w_1, ... w_n) = \frac{p(S=1) \prod_{i=1}^n p(\mathbf{w_i}|S=1)}{~\sum_{s=\{0, 1\}}p(S=s)\prod_{i=1}^n p(\mathbf{w_i}|S=s)}
$$
Let me instantiate that function by a simple NLP task bellow.
We decide to detect if the coming email is spam($S=1$) or not spam($S=0$) and we have a word vocabulary of 5,000 in size($n=5,000$) and the only concern is if a word($w_i$) occurs($p(w_i|S=1)$) in the email or not($1-p(w_i|S=1)$) for simplicity(Bernoulli naive Bayes).
In [1]: import numpy as np
In [2]: from sklearn.naive_bayes import BernoulliNB
# let's train our model with 200 samples
In [3]: X = np.random.randint(2, size=(200, 5000))
In [4]: y = np.random.randint(2, size=(200, 1)).ravel()
In [5]: clf = BernoulliNB()
In [6]: model = clf.fit(X, y)
We can see that $p(S=s)\prod_{i=1}^n p(\mathbf{w_i}|S=s)$ would be very small because of the probabilities(both $p(w_i|S=1)$ and $1-p(w_i|S=1)$ would be between 0 and 1) in $\prod_i^{5000}$, and hence we are sure that the product would be smaller than $5e^{-324}$ and we just get $0/0$.
In [7]: (np.nextafter(0, 1)*2) / (np.nextafter(0, 1)*2)
Out[7]: 1.0
In [8]: (np.nextafter(0, 1)/2) / (np.nextafter(0, 1)/2)
/home/lerner/anaconda3/bin/ipython3:1: RuntimeWarning: invalid value encountered in double_scalars
#!/home/lerner/anaconda3/bin/python
Out[8]: nan
In [9]: l_cpt = model.feature_log_prob_
In [10]: x = np.random.randint(2, size=(1, 5000))
In [11]: cls_lp = model.class_log_prior_
In [12]: probs = np.where(x, np.exp(l_cpt[1]), 1-np.exp(l_cpt[1]))
In [13]: np.exp(cls_lp[1]) * np.prod(probs)
Out[14]: 0.0
Then the problem arises: How can we calculate the probability of the email is a spam $p(S=1|w_1, ... w_n)$? Or how can we calculate the numerator and the denominator?
We can see the official implementation in sklearn:
jll = self._joint_log_likelihood(X)
# normalize by P(x) = P(f_1, ..., f_n)
log_prob_x = logsumexp(jll, axis=1)
return jll - np.atleast_2d(log_prob_x).T
For the numerator it converted the product of probabilities to the sum of log likelihood and for the denominator it used the logsumexp in scipy which is:
out = log(sum(exp(a - a_max), axis=0))
out += a_max
Because we cannot add two joint probabilities by adding its joint log likelihood, and we should go out from the log space to the probability space. But we cannot add the two true probabilities because they are too small and we should scale them and do the addition: $\sum_{s=\{0,1\}} e^{jll_s - max\_jll}$ and put the result back into the log space $\log\sum_{s=\{0,1\}} e^{jll_s - max\_jll}$ then rescale it back: $max\_jll+ \log\sum_{s=\{0,1\}} e^{jll_s - max\_jll}$ in log space by adding the $max\_jll$.
And here is the derivation:
$\begin{align}
\log \sum_{s=\{0,1\}} e^{jll_s} & =
\log \sum_{s=\{0,1\}} e^{jll_s}e^{max\_jll-max\_jll} \\& =
\log e ^{max\_jll}+ \log\sum_{s=\{0,1\}} e^{jll_s - max\_jll} \\& =
max\_jll+ \log\sum_{s=\{0,1\}} e^{jll_s - max\_jll}
\end{align}$
where $max\_jll$ is the $a\_max$ in the code.
Once we get both the numerator and the denominator in log space we can get the log conditional probability($\log p(S=1|w_1, ... w_n)$) by subtracting the denominator from the numerator:
return jll - np.atleast_2d(log_prob_x).T
Hope that helps.
Reference:
1. Bernoulli Naive Bayes Classifier
2. Spam Filtering with Naive Bayes – Which Naive Bayes?