3

I'm reading a research paper about fraud detection (unbalanced binary classification) where the authors go for synthetic data for evaluating their methods. I want to reproduce their synthetic data but it's description is not entirely clear for me. They define the process by the following:

A simulator is designed to generate the genuine data as well as that of fraud data. We design a simulator which can be described as follows:
(1) Total transactions for each cardholder: the total number of transactions for each cardholder is normally distributed with fixed μ (mean) = 275 and σ (standard deviation) = 20.
(2) Time: we generate the time of each transaction in 6 months (from January 2016 to June 2016). The transaction time is normally distributed in the 6 months.
(3) Amount: in order to distinguish different consume habits of different cardholders, we generate two kinds of consume distribution for 10 cardholders. For 5 cardholders, the amount is normally distributed with specified μ (mean) = 50 and σ (standard deviation) = 10. For other 5 ones, the amount is normally distributed with fixed μ (mean) = 1000 and σ (standard deviation) = 150.
(4) Location: we choose 7 cardholders who always consume in some fixed places and set their location-count as 1-3. And the other 3 cardholders always consume in different places with 10-15 locations.
We examine the performance on five datasets with different fraud rate which are 2%, 5%, 10%, 20%, 25% respectively.

Now, I have the questions about this definition:

  • I can follow the feature generation but where are the labels actually? My best guess maybe in (3) at "two kinds of consume distribution for 10 cardholders", but I'm not sure this is about different data (authentic and fraudulent) for the same respective cardholders. I'd rather simply understand this that there are two groups of cardholders with the same data distribution among themselves. I think the point of this synthetic training data generation would to generate data which would be different somehow with respect to the targets, and just generating some independent features doesn't guarantee this. I cannot see in the definition above how the authors guarantee that two subsets of the generated data differ from analysis's perspective.
  • What is "specified μ" and "fixed μ" mentioned respectively in (3)?
Fredrik
  • 671
  • 1
  • 5
  • 8

1 Answers1

1

Let's survey the features:

(1) Total number of transactions per client $i=1,...,10$, $n_i\sim N(275, 20)$.

(2) Timestamp $t_{ij}$ for each transaction (that's the $j^{th}$ transaction of client $i$). Why is it normally distributed over 6 months? Dunno, I'd go with uniform but that's not my paper.

(3) Amount of transaction $a_{ij}$: how much money did client $i$ spend on his $j^{th}$ transaction? Here, they create two profiles of clients: for little fish, $a_{ij}\sim N(50,10)$. For big spenders, $a_{ij}\sim N(1000,150)$. This has nothing to do with fraudulency (yet). There are 5 clients of each profile.

(4) Location $l_{ij}$ for each transaction. No explanations needed.

Overall, they generate about 2750 transactions per dataset, and label a portion of the samples as fraudulent. This portion differs for each set: 2%, 5%, 10%, 20% or 25%.

From here, I can only assume they have split each dataset into train and test sets (can't see other reasonable way to create the BCs) and then the results shown in Table 1 are for the respective training sets using different thresholds.

Spätzle
  • 2,331
  • 1
  • 10
  • 25
  • But if they generate a relatively homogenous data and they just split it randomly in whatever proportions into two and assign 0/1 to the two subsets, what is the point? One must ensure that the separation of the data is along some separating feature, right? One must inject "fraudulency" somehow into the fraudulent subset. – Fredrik Aug 16 '21 at 08:41
  • They do no base their detection on the obvious well-known features of extreme amounts, distant locations, etc. Their target is to find the needle in the haystack, which is why they try to detect fraudulences inside what's looking like homogeneous data. Their method relates to very fine details, it's all written in the paper. If you think of it, fraudulences can occur to anyone, and the thieves are getting smarter by the day. It complies with their ideas. – Spätzle Aug 16 '21 at 10:36
  • I understand the actual feature engineering process, no problem with that. I just want to reproduce their results which based on the synthetic data in question. It is a supervised task (evaluation metrics are according to this as well), so one must predefine what is fraudulent and what is normal. If it were an unsupervised process I'd accept your view, but in thic case there must be some further explanation for the lack of labels. – Fredrik Aug 16 '21 at 13:48