Skewed dataset performance measurement in machine learning

Question

Let's say we're building a spam classifier. When we feed it an email, it accurately classifies it as spam/not-spam 98% of the time. Then we discover that 99% of the email we receive is actually spam. The 98% accuracy doesn't look good anymore.

On skewed datasets (e.g., when there are more positive examples than negative examples, accuracy is not a good measure of performance and you should instead use F score, which is based on precision and recall. -Week 6 of Andrew Ng's Machine Learning on Coursera

I can see the value of looking at precision and recall separately, but I don't see the value of combining them with a non intuitive formula to replace accuracy as our primary overall measurement of how our learning algorithm is performing. If 99% is the accuracy that a simple "always guess yes" -algorithm can achieve, can't we just set that as a comparison and say that our algorithm is doing good when it's achieving considerably better accuracy (like 99,76% for instance).

Accuracy has the obvious benefit that it is simple and intuitive. What advantages does the F-score hold over accuracy in our "spam classifier" -example?

I think the concept you're looking for is ROC/AUC: http://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it — Alex R., Aug 25 '16 at 23:11

score 1 · Answer 1 · answered Aug 26 '16 at 03:42

I would like to put my 2 cents here. Accuracy (A) is a more generic measure for any classifier. But, Precision (P) or Recall (R) are more domain-specific. E.g. for the spam-No spam classifier, you need to choose b/w False Negative and False Positive i.e. whether a non-spam spammed is critical for ur business or a spam not spammed. If u choose the former go for maximizing the P and in 2nd case R would be more important.

Why F-score, is a bit different question. Actually, you can over-inflate your P or R depending on the business need, as discussed, so, a more conservative measure, Harmonic Mean (it is closest to the lower value among AM, GM and HM) is more suitable in order to test a given model.

Skewed dataset performance measurement in machine learning

1 Answers1