0

I am studying few examples of simple Naive Bayes for Spam detection. I had a question it, but I am unable to find it in any of the examples.

I was wondering, what will happen if a word appears multiple times in emails. For example, if we have total of 4 Spam Emails, and they contain the word "Password" 8 times, what will be the probability of P(Password|Spam) then. According to the formula they are using in examples, it will become 8/4 = 2, which obviously is not possible, as probability can never be greater than 1. What am I missing, please help.

Tim
  • 108,699
  • 20
  • 212
  • 390
Johnny
  • 1

2 Answers2

2

There are multiple ways how you can use text features in your machine learning algorithms. You can simply encode if a word occurred in the text (coded as 0 - no, 1 - yes), you can use bag-of-words (count their occurrences), using $n$-grams (combinations of ordered words), TF-IDF scores, Word2vec encoding, you can also consider their position in the text and there are many other possible representations. The most simple applications would be to 0-1 encode the occurrence of a word, and then you'd be dealing only with binary features in your naive Bayes algorithm. How do you do it depends on many factors, e.g. if you have a huge dataset you may be more prone to use a more simple method, or if you want to improve the performance of your algorithm, you may consider something more sophisticated.

Tim
  • 108,699
  • 20
  • 212
  • 390
0

Could you say where you are getting the examples and formula's from?

Here is an excellent question and answer on Naive Bayes: Understanding Naive Bayes

When using Naive Bayes you are not getting a normalized probability distribution, but rather a ranking that is proportional to it.

Taken from the post reference above:

Bayes Theorem:

$P(class|features)=\frac{P(features|class)⋅P(class)}{P(features)}$

In Naive Bayes, we don't divide by $P(features)$, giving us:

$P(class|features)∝P(features|class)⋅P(class)$

If I have the data and formulas you are working with I can try to explain it with reference to your problem in particular.