In Naive Bayes, why bother with Laplace smoothing when we have unknown words in the test set?

Question

I was reading over Naive Bayes Classification today. I read, under the heading of Parameter Estimation with add 1 smoothing:

Let $c$ refer to a class (such as Positive or Negative), and let $w$ refer to a token or word.

The maximum likelihood estimator for $P(w|c)$ is $$\frac{count(w,c)}{count(c)} = \frac{\text{counts w in class c}}{\text{counts of words in class c}}.$$

This estimation of $P(w|c)$ could be problematic since it would give us probability $0$ for documents with unknown words. A common way of solving this problem is to use Laplace smoothing.

Let V be the set of words in the training set, add a new element $UNK$ (for unknown) to the set of words.

Define $$P(w|c)=\frac{\text{count}(w,c) +1}{\text{count}(c) + |V| + 1},$$

where $V$ refers to the vocabulary (the words in the training set).

In particular, any unknown word will have probability $$\frac{1}{\text{count}(c) + |V| + 1}.$$

My question is this: why do we bother with this Laplace smoothing at all? If these unknown words that we encounter in the testing set have a probability that is obviously almost zero, ie, $\frac{1}{\text{count}(c) + |V| + 1}$, what is the point of including them in the model? Why not just disregard and delete them?

If you don't then any statement you encounter containing a previously unseen word will have $p=0$. This means that an impossible event has come to pass. Which means your model was an incredibly bad fit. Also in a proper Bayesian model this could never happen, as the unknown word probability would have a numerator given by the prior (possibly not 1). So I don't know why this requires the fancy name 'Laplace smoothing'. — conjectures, Jan 29 '16 at 09:55

score 20 · Answer 1 · edited Oct 25 '19 at 10:34

Let's say you've trained your Naive Bayes Classifier on 2 classes, "Ham" and "Spam" (i.e. it classifies emails). For the sake of simplicity, we'll assume prior probabilities to be 50/50.

Now let's say you have an email $(w_1, w_2,...,w_n)$ which your classifier rates very highly as "Ham", say $$P(Ham|w_1,w_2,...w_n) = .90$$ and $$P(Spam|w_1,w_2,..w_n) = .10$$

So far so good.

Now let's say you have another email $(w_1, w_2, ...,w_n,w_{n+1})$ which is exactly the same as the above email except that there's one word in it that isn't included in the vocabulary. Therefore, since this word's count is 0, $$P(Ham|w_{n+1}) = P(Spam|w_{n+1}) = 0$$

Suddenly, $$P(Ham|w_1,w_2,...w_n,w_{n+1}) = P(Ham|w_1,w_2,...w_n) * P(Ham|w_{n+1}) = 0$$ and $$P(Spam|w_1,w_2,..w_n,w_{n+1}) = P(Spam|w_1,w_2,...w_n) * P(Spam|w_{n+1}) = 0$$

Despite the 1st email being strongly classified in one class, this 2nd email may be classified differently because of that last word having a probability of zero.

Laplace smoothing solves this by giving the last word a small non-zero probability for both classes, so that the posterior probabilities don't suddenly drop to zero.

why would we keep a word which doesn't exists in the vocabulary at all? why not just remove it? — avocado, Sep 28 '16 at 08:05
if your classifier rates an email as likely to be ham, then p(ham| w1,...,wn) is 0.9, not p(w1,...,wn|ham) — braaterAfrikaaner, Feb 28 '18 at 20:59

score 18 · Answer 2 · answered Jul 22 '14 at 05:21

18

You always need this 'fail-safe' probability.

To see why consider the worst case where none of the words in the training sample appear in the test sentence. In this case, under your model we would conclude that the sentence is impossible but it clearly exists creating a contradiction.

Another extreme example is the test sentence "Alex met Steve." where "met" appears several times in the training sample but "Alex" and "Steve" don't. Your model would conclude this statement is very likely which is not true.

answered Jul 22 '14 at 05:21

Sid

2,489
10
15

I hate to sound like a complete moron, but would you mind elaborating? How does removing "Alex" and "Steve" change the likelihood of the statement occurring? – tumultous_rooster Jul 22 '14 at 06:21
2

If we assume independence of the words P(Alex)P(Steve)P(met) << P(met) – Sid Jul 22 '14 at 08:48
1

we could build a vocabulary when training the model on the training data set, so why not just remove all new words not occur in vocabulary when make predictions on test data set? – avocado Sep 28 '16 at 07:16

score 7 · Answer 3 · answered Jul 22 '14 at 08:33

Disregarding those words is another way to handle it. It corresponds to averaging (integrate out) over all missing variables. So the result is different. How?

Assuming the notation used here: $$ P(C^{*}|d) = \arg\max_{C} \frac{\prod_{i}p(t_{i}|C)P(C)}{P(d)} \propto \arg\max_{C} \prod_{i}p(t_{i}|C)P(C) $$ where $t_{i}$ are the tokens in the vocabulary and $d$ is a document.

Let say token $t_{k}$ does not appear. Instead of using a Laplace smoothing (which comes from imposing a Dirichlet prior on the multinomial Bayes), you sum out $t_{k}$ which corresponds to saying: I take a weighted voting over all possibilities for the unknown tokens (having them or not).

$$ P(C^{*}|d) \propto \arg\max_{C} \sum_{t_{k}} \prod_{i}p(t_{i}|C)P(C) = \arg\max_{C} P(C)\prod_{i \neq k}p(t_{i}|C) \sum_{t_{k}} p(t_{k}|C) = \arg\max_{C} P(C)\prod_{i \neq k}p(t_{i}|C) $$

But in practice one prefers the smoothing approach. Instead of ignoring those tokens, you assign them a low probability which is like thinking: if I have unknown tokens, it is more unlikely that is the kind of document I'd otherwise think it is.

score 7 · Answer 4 · edited Jun 24 '19 at 13:24

This question is rather simple if you are familiar with Bayes estimators, since it is the directly conclusion of Bayes estimator.

In the Bayesian approach, parameters are considered to be a quantity whose variation can be described by a probability distribution(or prior distribution).

So, if we view the procedure of picking up as multinomial distribution, then we can solve the question in few steps.

First, define

$$m = |V|, n = \sum n_i$$

If we assume the prior distribution of $p_i$ is uniform distribution, we can calculate it's conditional probability distribution as

$$p(p_1,p_2,...,p_m|n_1,n_2,...,n_m) = \frac{\Gamma(n+m)}{\prod\limits_{i=1}^{m}\Gamma(n_i+1)}\prod\limits_{i=1}^{m}p_i^{n_i}$$

we can find it's in fact Dirichlet distribution, and expectation of $p_i$ is

$$ E[p_i] = \frac{n_i+1}{n+m} $$

A natural estimate for $p_i$ is the mean of the posterior distribution. So we can give the Bayes estimator of $p_i$:

$$ \hat p_i = E[p_i] $$

You can see we just draw the same conclusion as Laplace Smoothing.

Aiaioo Labs · Answer 5 · 2016-01-29T10:59:54.067

You want to know why we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).

The answer to your question is: not all words have to be unknown in all classes.

Say there are two classes M and N with features A, B and C, as follows:

M: A=3, B=1, C=0

(In the class M, A appears 3 times and B only once)

N: A=0, B=1, C=3

(In the class N, C appears 3 times and B only once)

Let's see what happens when you throw away features that appear zero times.

A) Throw Away Features That Appear Zero Times In Any Class

If you throw away features A and C because they appear zero times in any of the classes, then you are only left with feature B to classify documents with.

And losing that information is a bad thing as you will see below!

If you're presented with a test document as follows:

B=1, C=3

(It contains B once and C three times)

Now, since you've discarded the features A and B, you won't be able to tell whether the above document belongs to class M or class N.

So, losing any feature information is a bad thing!

B) Throw Away Features That Appear Zero Times In All Classes

Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?

No, because that would create its own problems!

The following test document illustrates what would happen if we did that:

A=3, B=1, C=1

The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).

C) Don't Throw Anything Away - Use Smoothing Instead

Smoothing allows you to classify both the above documents correctly because:

You do not lose count information in classes where such information is available and
You do not have to contend with zero counts.

Naive Bayes Classifiers In Practice

The Naive Bayes classifier in NLTK used to throw away features that had zero counts in any of the classes.

This used to make it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data).

@ Aiaioo Labs You failed to realize that he was referring to words that did not appear in the training set at all, for your example, he was referring to say if D appeared, the issue isn't with laplace smoothing on the calculations from the training set rather the test set. Using laplace smoothing on unknown words from the TEST set causes probability to be skewed towards whichever class had the least amount of tokens due to 0 + 1 / 2 + 3 being bigger that 0 + 1 / 3 + 3 (if one of the classes had 3 tokens and the other had 2). ... — , Apr 29 '16 at 18:20
This can actually turn a correct classification into an incorrect classification if enough unknown words are smoothed into the equation. Laplace smoothing is ok for Training set calculations, but detrimental to test set analysis. Also imagine you have a test set with all unkown words, it should be classified immediately to the class with highest probability, but in fact it can and will usually, not be classified as such, and is usually classified as the class with the lowest amount of tokens. — , Apr 29 '16 at 18:20
@DrakeThatcher, highly agree with you, yes if we don't remove words not in vocabulary, then predicted proba will be skewed to class with least amount of words. — avocado, Sep 28 '16 at 08:10

score 1 · Answer 6 · edited Jun 24 '19 at 13:24

Matt you are correct you raise a very good point - yes Laplace Smoothing is quite frankly nonsense! Just simply throwing away those features can be a valid approach, particularly when the denominator is also a small number - there is simply not enough evidence to support the probability estimation.

I have a strong aversion to solving any problem via use of some arbitrary adjustment. The problem here is zeros, the "solution" is to just "add some small value to zero so it's not zero anymore - MAGIC the problem is no more". Of course that's totally arbitrary.

Your suggestion of better feature selection to begin with is a less arbitrary approach and IME increases performance. Furthermore Laplace Smoothing in conjunction with naive Bayes as the model has in my experience worsens the granularity problem - i.e. the problem where scores output tend to be close to 1.0 or 0.0 (if the number of features is infinite then every score will be 1.0 or 0.0 - this is a consequence of the independence assumption).

Now alternative techniques for probability estimation exist (other than max likelihood + Laplace smoothing), but are massively under documented. In fact there is a whole field called Inductive Logic and Inference Processes that use a lot of tools from Information Theory.

What we use in practice is of Minimum Cross Entropy Updating which is an extension of Jeffrey's Updating where we define the convex region of probability space consistent with the evidence to be the region such that a point in it would mean the Maximum Likelihood estimation is within the Expected Absolute Deviation from the point.

This has a nice property that as the number of data points decreases the estimations peace-wise smoothly approach the prior - and therefore their effect in the Bayesian calculation is null. Laplace smoothing on the other hand makes each estimation approach the point of Maximum Entropy that may not be the prior and therefore the effect in the calculation is not null and will just add noise.

score 1 · Answer 7 · answered Sep 15 '16 at 06:48

I also came across the same problem while studying Naive Bayes.

According to me, whenever we encounter a test example which we hadn't come across during training, then out Posterior probability will become 0.

So adding the 1 , even if we never train on a particular feature/class, the Posterior probability will never be 0.

score 0 · Answer 8 · answered Apr 07 '20 at 23:52

You may don't have enough data for the task and hence the estimate would not be accurate or the model would overfit training data, for example, we may end up with a black swan problem. There is no black swan in our training examples but that doesn't mean that there exists no black swan in the world. We can just add a prior to our model and we can also call it "pseudocount".

In Naive Bayes, why bother with Laplace smoothing when we have unknown words in the test set?

8 Answers8

Linked