R Naive Bayes and Laplace: Even turned off, works fine with unseen words in test data?

Question

I'm trying to better understand Laplace+1 smoothing on Naive Bayes for text classification. Using the e1071 package in R, naiveBayes() function, I get some confusing results.

If I fit a model using laplace = 1 (+1, true), the predicted probabilities are identical for both test data that contains no new words unseen in training, and test data that does. I was under the impression that by adding +1 to all counts on test data, some probability mass is "stolen" and given to the unknown features. So, shouldn't the probabilities be a little different?

(I can turn laplace "way up," and it doesn't seem to do anything. I've tried on more complicated data with 20 or so features counted for training and adding in 3 more for test, and it doesn't change anything, probabilities are still identical with test data with unseen words versus test data without them).

library(e1071)

# Training
train_x <- data.frame(
a = rpois(10, 2),
b = rpois(10, 4),
c = rpois(10, 1)
)
train_y <- round(runif(10))
model <- naiveBayes(x = train_x, y = train_y, laplace = 1)

# Scoring, all words are known
test_x <- data.frame(
a = 1,
b = 5,
c = 4
)
predict(model, newdata = test_x, type = "raw")
              0         1
[1,] 1.297711e-06 0.9999987


# Scoring, unknown word 'd'
test_xnew <- data.frame(
a = 1,
b = 5,
c = 4,
d = 2
)
predict(model, newdata = test_xnew, type = "raw")
              0         1
[1,] 1.297711e-06 0.9999987

Even more confusing, if I train the model with Laplace OFF, and then try to predict on both test data with no new words unseen in training AND on test data with new words, I get identical results (for all 4 cases, now, including the 2 above). R didn't even throw an error that there were new columns/features on the test data used for predict, that weren't seen during training, even with Laplace off.

> model <- naiveBayes(x = train_x, y = train_y, laplace = 0)
> test_x <- data.frame(
+   a = 1,
+   b = 5,
+   c = 4
+ )
> predict(model, newdata = test_x, type = "raw")
                0         1
[1,] 1.297711e-06 0.9999987

> test_xnew <- data.frame(
+   a = 1,
+   b = 5,
+   c = 4,
+   d = 2
+ )
> predict(model, newdata = test_xnew, type = "raw")
                0         1
[1,] 1.297711e-06 0.9999987

Thus, regardless of parameters (unseen words or not, and any value, including none, for laplace, my predicted probabilities are always identical).

What am I doing wrong? Maybe it is simply smoothing out the new word d, so that it is so tiny, I can't see the effect with that few decimal places? But I don't think so. But I tried this ex on the HouseVotes data: https://rdrr.io/cran/e1071/man/naiveBayes.html, and it does give me different results for laplace or not. So it is possible to have a situation where it has an effect.

So...this leads me to think it's feature complexity, in some way, that my ex is too contrived? But if I try with 20 features generated the same way for training, and 3 new features (and one of the training "words" count set to "0" for all "docs") for test data...same identical results for all permutations.

edit: shoot, I'm sorry, just found this: https://stackoverflow.com/questions/36875905/naivebayes-giving-unexpected-result-when-using-nonzero-laplace-argument-package

-So-

Does this mean that I can't use this package function, for laplace for document term matrices? That's what I was trying to do. The Stack Overflow question I linked notes that laplace only works for categorical features. Does this mean, "ever," not just in e1071 implementation of Naive Bayes? Bc that can't be right. How do you ever do laplace for document term matrix? I just want a way to use laplace for unseen words in prediction on document matrix features, in R.

edit: this also seems highly relevant: https://stackoverflow.com/questions/21163207/document-term-matrix-for-naive-bayes-classfier-unexpected-results-r Can't believe he is saying, "convert document term matrix values to factor"? But it sort of sounds like that.

Tim · Answer 1 · 2019-01-02T08:45:06.090

2

If you check the source code, predict.naiveBayes method from e1071 package simply ignores the new columns from test set (words that were not seen in training set) and iterates over the ones that were seen in the training set.

attribs <- match(names(object$tables), names(newdata))
[...]
L <- sapply(1:nrow(newdata), function(i) {
    ndata <- newdata[i, ]
    L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
        function(v) {
            nd <- ndata[attribs[v]]
[...]

Moreover, you can easily check this by creating a test set containing only the unseen words (it will return error).

In many cases the algorithm is implemented differently, where we wouldn't ignore the unseen words, but rather use the Laplace smoothed probabilities for them (0 + constant / normalizing constant). If you need such behavior, you need to use other package, or adapt the predict.naiveBayes by yourself.

edited Jan 02 '19 at 08:45

answered Dec 19 '18 at 12:54

Tim

108,699
20
212
390

"simply ignores" even with laplace argument? So the laplace argument is a fake? The question I linked seems to imply that it indeed changed the probs by including it (correctly, with categorical features). – userninenineninenine Dec 19 '18 at 12:56
it seems like you're saying, 1) e1071 always drops new. 2) there is a different way to do this: laplace so it's hard to rationalize this with the fact that there is a laplace argument for e1071 – userninenineninenine Dec 19 '18 at 12:58
@userninenineninenine it depends what you mean by "fake", but yes, it smooths only the probabilities for words in training set. – Tim Dec 19 '18 at 12:58
is there any R package for nb that takes document matrix features, and works laplace properly on unseen words? I can't find any other package that doesn't just take the approach of either, "retrain vocab of all new test words before making predictions", or, "drop all new words." I thought (and your answer seems to agree?) that Laplace was a different approach. – userninenineninenine Dec 19 '18 at 13:00
@userninenineninenine I won't recommend any other package, but you can easily adapt the code by yourself, so that instead of ignoring the unseen words it uses the constant probability for them. – Tim Dec 19 '18 at 13:02
yes actually that link "why even use smoothing?" is part of why I thought this would be possible/was so confused by the package's behavior – userninenineninenine Dec 19 '18 at 13:04
@userninenineninenine I can't talk for it's author. I was also surprised by this and didn't know this before checking the source code. I guess, you should e-mail the author with a bug report, since this should at least be documented and it isn't. – Tim Dec 19 '18 at 13:07
sorry, that's prob a step over my head. if no one can give me info as to the functionality entailed by the packages, I should prob drop it. Thanks for your help. – userninenineninenine Dec 19 '18 at 13:07
@Tim , so it shouldn't work with completely different columns from train and test dataset? Because [here](https://stats.stackexchange.com/questions/423706/r-naive-bayes-model-e1071-why-does-it-works-with-totally-different-columns-in) I got a different result stated. – s__ Aug 26 '19 at 10:24
@s_t as I said, this is how it was implemented. I cannot talk for the author of this code on why he made such design decisions. As stated, there are other ways how this could be implemented, often better ones. – Tim Aug 26 '19 at 10:46

R Naive Bayes and Laplace: Even turned off, works fine with unseen words in test data?

1 Answers1

Linked