0

I just started getting involved with Machine Learning and I decided to create a spam filter for my social app, using the Naive Bayes classifier. I'm following this guide: https://hackernoon.com/how-to-build-a-simple-spam-detecting-machine-learning-classifier-4471fe6b816e

My app has ~70,000 posts and about 3,000 of them are marked as spam. How many of my non-spam posts should I use to train my model?

  • Possible duplicate of [Optimal case/control ratio in a case-control study](https://stats.stackexchange.com/questions/25740/optimal-case-control-ratio-in-a-case-control-study) – Sycorax Aug 06 '19 at 17:09

1 Answers1

1

In general, you do stratified sampling to create training/test splits; otherwise your priors will be biased. Specifically, in Naive Bayes, you estimate class priors from data. If the prior is $3/70$ and you choose to equally include spams and non-spams, your prior estimate will be $\pi=0.5$, which can easily harm your predictions. A typical train/test split can follow 80/20 convention.

gunes
  • 49,700
  • 3
  • 39
  • 75