Size of training in Naive Bayes

Question

I just started getting involved with Machine Learning and I decided to create a spam filter for my social app, using the Naive Bayes classifier. I'm following this guide: https://hackernoon.com/how-to-build-a-simple-spam-detecting-machine-learning-classifier-4471fe6b816e

My app has ~70,000 posts and about 3,000 of them are marked as spam. How many of my non-spam posts should I use to train my model?

Possible duplicate of [Optimal case/control ratio in a case-control study](https://stats.stackexchange.com/questions/25740/optimal-case-control-ratio-in-a-case-control-study) — Sycorax, Aug 06 '19 at 17:09

score 1 · Accepted Answer · answered Aug 06 '19 at 14:10

1

In general, you do stratified sampling to create training/test splits; otherwise your priors will be biased. Specifically, in Naive Bayes, you estimate class priors from data. If the prior is $3/70$ and you choose to equally include spams and non-spams, your prior estimate will be $\pi=0.5$, which can easily harm your predictions. A typical train/test split can follow 80/20 convention.

answered Aug 06 '19 at 14:10

gunes

49,700
3
39
75

So I should include all of my spam and non-spam posts in the training process? – Sotiris Kaniras Aug 06 '19 at 14:16
no, separate $80 \%$ of your samples for training, and $20 \%$ of them as test set, both for spam and non-spam. – gunes Aug 06 '19 at 14:17
80% of my whole database, right? – Sotiris Kaniras Aug 06 '19 at 14:19
yes, if you include 80 % of your spams and 80 % of your non-spams for training, the training set will be 80 % of your whole dataset. – gunes Aug 06 '19 at 14:20
Thank you very much! – Sotiris Kaniras Aug 06 '19 at 14:21

Size of training in Naive Bayes

1 Answers1