28

Statistics.com published a problem of the week: The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent). A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud. The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as “fraud”). If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

https://www.statistics.com/news/231/192/Conditional-Probability/?showtemplate=true

My peer and I both came up with the same answer independently and it doesn't match the published solution.

Our solution:

(.9*.1)/((.9*.1)+(.2*.9))=1/3

Their solution:

This is a problem in conditional probability. (It’s also a Bayesian problem, but applying the formula in Bayes Rule only helps to obscure what’s going on.) Consider 100 claims. 10 will be fraudulent, and the system will correctly label 9 of them as “fraud.” 90 claims will be OK, but the system will incorrectly classify 72 (80%) as “fraud.” So a total of 81 claims have been labeled as fraudulent, but only 9 of them, 11%, are actually fraudulent.

Who was right

Tim
  • 108,699
  • 20
  • 212
  • 390
ChrisG
  • 383
  • 3
  • 5
  • 4
    looks like they corrected the solution on their website to be in line with what you calculated – nope Dec 18 '18 at 16:11
  • 2
    @nope, quietly corrected the answer. sneaky – Aksakal Dec 18 '18 at 16:26
  • Trivia: in behavioral decision-making, this problem is often referred to as the "mammogram problem", since its usual presentation is about the chance of a patient having cancer given a positive mammogram. – Kodiologist Dec 18 '18 at 19:40
  • "The good news is, our system classifies 90% of fraud as fraud. The bad news is, it classifies 80% of non-fraud as fraud." Note the the 11% they calculate is only slightly higher than the 10% base rate. A machine learning model where the fraud rate in the flagged cases is only 10% more than the base rate is quite terrible. – Acccumulation Dec 18 '18 at 20:49
  • This is known as the [false positive paradox](https://en.wikipedia.org/wiki/Base_rate_fallacy#False_positive_paradox) – BlueRaja - Danny Pflughoeft Dec 18 '18 at 23:18

2 Answers2

41

I believe that you and your colleague are correct. Statistics.com has the correct line of thinking, but makes a simple mistake. Out of the 90 "OK" claims, we expect 20% of them to be incorrectly classified as fraud, not 80%. 20% of 90 is 18, leading to 9 correctly identified claims and 18 incorrect claims, with a ratio of 1/3, exactly what Bayes' rule yields.

James Otto
  • 574
  • 5
  • 6
11

You are correct. The solution that the website posted is based on a misreading of the problem in that 80% of the nonfraudulent claims are classified as fraudulent instead of the given 20%.

Dilip Sarwate
  • 41,202
  • 4
  • 94
  • 200