0

I built an artificial neural network that has a dependent variable called "Suspicious". This column is binary so only two outcomes. I have 297,771 "0" not suspicious or known good. Then I have only 1,100 rows in my data labeled as "1" for suspicious or bad. After the test set the confusion matrix looks like this:

cm
array([[59552, 0],
       [148,  75]])

This gives me a test accuracy of 99.75240%. This seems way too high. Is there a rule of thumb for how many bad or "1's" I should have in the data before I run it though the model, like 1/3, or 1/2?

sectechguy
  • 177
  • 7
  • Related: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models – Sycorax Aug 17 '18 at 18:37

1 Answers1

4

If I were a naive prediction model, based on the overall distribution of data, I would likely guess "$0$" for every single outcome. This would give me an accuracy of:

$$ \frac{297771}{297771+1100} = 0.9963195 = 99.63195\% \text{ accuracy} $$

Based on this, your prediction model is only marginally better than a naive estimator which assumes all outputs are $0$. Food for thought.

ERT
  • 1,265
  • 3
  • 15