1

I'm working at a motor insurance company and want to build a business rule to detect fraud cases based on the damage value. I have a historical data set that contains a list of accidents info, damage values, and fraud status (fraud or not).

My goal is to identify the best threshold of the damage value so my business rule can detect fraud cases with an acceptable FP (fraud cases damage value tend to be high).

Any Idea on how to calculate the best threshold?

Analyst
  • 11
  • 1
  • In that case, you should build a classification model. Firstly you should wonder which variables may potentially influence whether particular cases was fraud or not. If you have a big data set you can start with a full set of variables. You should build models belonging to a few different class, for instance, logistic regression, decision tree or SVM (it is hardly say which model will be better, based on your description). Then, you can decide which model is the best with classification using cross-validation. After that, you can use the best model with the prediction of new accidents. – Jakub Sep 08 '20 at 09:41
  • This question doesn’t quite make sense to me. Do you have some kind of machine learning model that outputs a probability of fraud, and you want to find a threshold to discretize the probabilities into “fraud” and “not fraud”? That might not even be the way to go: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email. But you’re also talking about the threshold of the damage value, not of a probability. Please elaborate on what you’re doing. – Dave Sep 08 '20 at 10:31
  • Thanks for your response and sorry if my question is not clear. Simply, I need to select the accidents that have damage values greater than **(threshold)** to send it to the anti-fraud team to complete the investigation. So how can find the best threshold that would indicate suspicious cases? – Analyst Sep 09 '20 at 09:19

2 Answers2

1

This is entirely domain specific, there is no single "best threshold" that will apply to all possible scenarios. You can usually move a classification threshold to have higher sensitivity at the cost of lower specificity and vice versa, so understanding that tradeoff in your particular classifier is a good place to start. Where you ultimately want to draw the threshold depends very much on your particular use case and relative "cost" of false positives vs. false negatives. It sounds like the cost of false negatives is high (not detecting real fraud is a big problem for customers) but that false positives are not so costly (reviewing a no-fraud case just costs the company some time but does not impact customers).

You will need to quantify the relative cost of these different misclassifications and combine it with the sensitivity/specificity characteristics of your classifier to identify a threshold that has the highest overall utility with respect to misclassification cost. For example, if a false negative is twice as costly as a false positive, you may set the threshold one way, but if it's 1000 times as costly, you'd be better served by lowering the threshold and sending more cases for fraud review. In the limit where false negatives are catastrophic, you can't afford to miss any real fraud cases, and are forced to lower the threshold all the way and send everything to fraud review.

Nuclear Hoagie
  • 5,553
  • 16
  • 24
0

If you have a dataset containing labels you should build a supervised machine learning (ML) model, so that you could generalise information in the dataset and make predictions about new incidents.

If you are able to get incidents info for each incident, you can exploit this to train a ML model. You could start with a simple baseline model like a Linear Discriminant Analysis, or even more simply a Naive Bayes using a binary-categorical distribution for the Likelihood. Then you can increase the complexity of the model if the accuracy is too low for your use case.

Fato39
  • 762
  • 8
  • 21
  • Thanks for your response and sorry if my question is not clear. Simply, I need to select the accidents that have damage values greater than **(threshold)** to send it to the anti-fraud team to complete the investigation. So how can find the best threshold that would indicate suspicious cases? – Analyst Sep 09 '20 at 09:19