Spam filtering using naive Bayesian classifiers with the e1071/klaR package on R

Question

I'm looking at doing text classification/spam filtering using naive Bayesian classifiers with the e1071 or klaR package on R. Is there a good tutorial out there to describe this? I'm kind of stuck because I'm not sure what to use as the data to input into the NaiveBayes function.

Some help very much appreciated, thanks!

chl · Answer 1 · 2011-02-28T21:16:27.870

The NaiveBayes() function in the klaR package obeys the classical formula R interface whereby you express your outcome as a function of its predictors, e.g. spam ~ x1+x2+x3. If your data are stored in a data.frame, you can input all predictors in the rhs of the formula using dot notation: spam ~ ., data=df means "spam as a function of all other variables present in the data.frame called df."

Here is a toy example, using the spam dataset discussed in the Elements of Statistical Learning (Hastie et al., Springer 2009, 2nd ed.), available on-line. This really is to get you started with the use of the R function, not the methodological aspects for using NB classifier.

data(spam, package="ElemStatLearn")
library(klaR)

# set up a training sample
train.ind <- sample(1:nrow(spam), ceiling(nrow(spam)*2/3), replace=FALSE)

# apply NB classifier
nb.res <- NaiveBayes(spam ~ ., data=spam[train.ind,])

# show the results
opar <- par(mfrow=c(2,4))
plot(nb.res)
par(opar)

# predict on holdout units
nb.pred <- predict(nb.res, spam[-train.ind,])

# raw accuracy
confusion.mat <- table(nb.pred$class, spam[-train.ind,"spam"])
sum(diag(confusion.mat))/sum(confusion.mat)

A recommended add-on package for such ML task is the caret package. It offers a lot of useful tools for preprocessing data, handling training/test samples, running different classifiers on the same data, and summarizing the results. It is available from CRAN and has a lot of vignettes that describe common tasks.

score 2 · Answer 2 · answered Dec 14 '11 at 04:21

There is a new book out called: Machine Learning for Email: Spam Filtering and Priority Inbox http://www.amazon.com/Machine-Learning-Email-Filtering-Priority/dp/1449314309/ref=sr_1_1?s=books&ie=UTF8&qid=1323836340&sr=1-1

I browsed the contents and the authors are explaining Bayesian spam filtering.

HTH

Spam filtering using naive Bayesian classifiers with the e1071/klaR package on R

2 Answers2