I'm trying to better understand Laplace+1 smoothing on Naive Bayes for text classification. Using the e1071 package in R, naiveBayes() function, I get some confusing results.
If I fit a model using laplace = 1 (+1, true), the predicted probabilities are identical for both test data that contains no new words unseen in training, and test data that does. I was under the impression that by adding +1 to all counts on test data, some probability mass is "stolen" and given to the unknown features. So, shouldn't the probabilities be a little different?
(I can turn laplace "way up," and it doesn't seem to do anything. I've tried on more complicated data with 20 or so features counted for training and adding in 3 more for test, and it doesn't change anything, probabilities are still identical with test data with unseen words versus test data without them).
library(e1071)
# Training
train_x <- data.frame(
a = rpois(10, 2),
b = rpois(10, 4),
c = rpois(10, 1)
)
train_y <- round(runif(10))
model <- naiveBayes(x = train_x, y = train_y, laplace = 1)
# Scoring, all words are known
test_x <- data.frame(
a = 1,
b = 5,
c = 4
)
predict(model, newdata = test_x, type = "raw")
0 1
[1,] 1.297711e-06 0.9999987
# Scoring, unknown word 'd'
test_xnew <- data.frame(
a = 1,
b = 5,
c = 4,
d = 2
)
predict(model, newdata = test_xnew, type = "raw")
0 1
[1,] 1.297711e-06 0.9999987
Even more confusing, if I train the model with Laplace OFF, and then try to predict on both test data with no new words unseen in training AND on test data with new words, I get identical results (for all 4 cases, now, including the 2 above). R didn't even throw an error that there were new columns/features on the test data used for predict, that weren't seen during training, even with Laplace off.
> model <- naiveBayes(x = train_x, y = train_y, laplace = 0)
> test_x <- data.frame(
+ a = 1,
+ b = 5,
+ c = 4
+ )
> predict(model, newdata = test_x, type = "raw")
0 1
[1,] 1.297711e-06 0.9999987
> test_xnew <- data.frame(
+ a = 1,
+ b = 5,
+ c = 4,
+ d = 2
+ )
> predict(model, newdata = test_xnew, type = "raw")
0 1
[1,] 1.297711e-06 0.9999987
Thus, regardless of parameters (unseen words or not, and any value, including none, for laplace, my predicted probabilities are always identical).
What am I doing wrong? Maybe it is simply smoothing out the new word d, so that it is so tiny, I can't see the effect with that few decimal places? But I don't think so. But I tried this ex on the HouseVotes data: https://rdrr.io/cran/e1071/man/naiveBayes.html, and it does give me different results for laplace or not. So it is possible to have a situation where it has an effect.
So...this leads me to think it's feature complexity, in some way, that my ex is too contrived? But if I try with 20 features generated the same way for training, and 3 new features (and one of the training "words" count set to "0" for all "docs") for test data...same identical results for all permutations.
edit: shoot, I'm sorry, just found this: https://stackoverflow.com/questions/36875905/naivebayes-giving-unexpected-result-when-using-nonzero-laplace-argument-package
-So-
Does this mean that I can't use this package function, for laplace for document term matrices? That's what I was trying to do. The Stack Overflow question I linked notes that laplace only works for categorical features. Does this mean, "ever," not just in e1071 implementation of Naive Bayes? Bc that can't be right. How do you ever do laplace for document term matrix? I just want a way to use laplace for unseen words in prediction on document matrix features, in R.
edit: this also seems highly relevant: https://stackoverflow.com/questions/21163207/document-term-matrix-for-naive-bayes-classfier-unexpected-results-r Can't believe he is saying, "convert document term matrix values to factor"? But it sort of sounds like that.