3

I've upto one million transaction dataset in the Oracle database & it's a data warehouse based system. I need to prepare a ATM card fraud detection model based on the dataset available. Is it possible to do in real time?

I would really appreciate if some links for research papers/case-studies are provided.

Clearification: I'd also visited the link below provided as duplicate before posting this question. To make question clear, I've 30+ fields including the ATM num, withdrawl amount, date, time, balance,...More importantly, my database isnot OLTP, it is for OLAP analysis [Data warehouse]. I've also gone through various anomaly detection approaches.

Gala
  • 8,323
  • 2
  • 28
  • 42
Jivan
  • 131
  • 4
  • Possible duplicate of http://stats.stackexchange.com/questions/6949/any-good-reference-books-material-to-help-me-build-a-txn-level-fraud-detection-m/9525#9525 or http://stats.stackexchange.com/questions/25460/how-to-apply-clustering-analysis-to-help-identify-criminal-entities-out-from-cre/25462#25462 if the question stays at this very general level. We need some more information if you want more help more specifically. What variables do you have for each transaction in the warehouse? Are those variables relevant to any information you have on the characteristics of frauds? etc. – Peter Ellis Jul 22 '13 at 06:27
  • Thanks for the addition. I'm presuming that although the data's been warehoused you can still turn it into a rectangle with one row per transaction. Do you have any information on historical transactions that have turned out to be frauds which could be used as the response variable in a logistic regression? – Peter Ellis Jul 22 '13 at 06:58
  • You mean it is possible for the real time fraud detection....? yeah, the bank has provided each transaction as fraud or non-fraud. @PeterEllis – Jivan Jul 22 '13 at 07:02
  • Do you have a response? Binary 0/1, that is Not fraud/Fraud? – Eric Paulsson Jul 22 '13 at 07:08
  • Yeah, for our available dataset it is labeled binary - 0/1. But the classification should be probabilistic, I think. – Jivan Jul 22 '13 at 07:11

2 Answers2

1

You can fit a logistic regression model to the historical data for which you know whether each transaction is fraud or not. Then all you need to do as the new data comes in is plug the values of each new transaction into your model and it will give you the probability (or, in the first place, the logarithm of the odds, which is easily converted into a probability) that it is a fraud.

Whether this is effective will depend on the cost of a false positive compared to a false negative, and the success of the model in picking up what makes a fraud (eg if all frauds are at 3:15 pm for $11,000 you're in luck - but if they vary in their characteristics it will be harder to pick them up).

Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
  • What could be the possible features for this? I've extracted about 10 features and trying to use SVM, will this really help? – Jivan Jul 22 '13 at 08:11
0

You have a response variable and a bunch of what may be explanatory variables. This makes classification possible (supervised learning techniques). Statistical modeling, for example logistic regression, is possible as well.

There are a bunch of different techniques and they all have their own strengths and weaknesses. You said you wan't the classification to be probabilistic, though I'm not quite sure what you mean by that. For example, the final output of a neural network is possible to interpret as a probability because of the final sigmoid function linking the information together.

In a fraud detection context, the response variable tend to be very skewed. There tend to be very few positive observations (the frauding lot). Classification techniques tend to look to minimize the error between observated values and predicted values, and they do not care that it may considered more important to find the frauders, as long as the total error rate is reasonably low. This is why you probably need to oversample the minority class. My advise would be to look into the SMOTE-algorithm, it will minimize overfitting if used correctly.

There are other ways to go as well, anomaly detection and techniques from that area might be worth looking into.

Eric Paulsson
  • 128
  • 2
  • 7