I was reading over Naive Bayes Classification today. I read, under the heading of Parameter Estimation with add 1 smoothing:
Let $c$ refer to a class (such as Positive or Negative), and let $w$ refer to a token or word.
The maximum likelihood estimator for $P(w|c)$ is $$\frac{count(w,c)}{count(c)} = \frac{\text{counts w in class c}}{\text{counts of words in class c}}.$$
This estimation of $P(w|c)$ could be problematic since it would give us probability $0$ for documents with unknown words. A common way of solving this problem is to use Laplace smoothing.
Let V be the set of words in the training set, add a new element $UNK$ (for unknown) to the set of words.
Define $$P(w|c)=\frac{\text{count}(w,c) +1}{\text{count}(c) + |V| + 1},$$
where $V$ refers to the vocabulary (the words in the training set).
In particular, any unknown word will have probability $$\frac{1}{\text{count}(c) + |V| + 1}.$$
My question is this: why do we bother with this Laplace smoothing at all? If these unknown words that we encounter in the testing set have a probability that is obviously almost zero, ie, $\frac{1}{\text{count}(c) + |V| + 1}$, what is the point of including them in the model? Why not just disregard and delete them?