I'm reading a research paper about fraud detection (unbalanced binary classification) where the authors go for synthetic data for evaluating their methods. I want to reproduce their synthetic data but it's description is not entirely clear for me. They define the process by the following:
A simulator is designed to generate the genuine data as well as that of fraud data. We design a simulator which can be described as follows:
(1) Total transactions for each cardholder: the total number of transactions for each cardholder is normally distributed with fixed μ (mean) = 275 and σ (standard deviation) = 20.
(2) Time: we generate the time of each transaction in 6 months (from January 2016 to June 2016). The transaction time is normally distributed in the 6 months.
(3) Amount: in order to distinguish different consume habits of different cardholders, we generate two kinds of consume distribution for 10 cardholders. For 5 cardholders, the amount is normally distributed with specified μ (mean) = 50 and σ (standard deviation) = 10. For other 5 ones, the amount is normally distributed with fixed μ (mean) = 1000 and σ (standard deviation) = 150.
(4) Location: we choose 7 cardholders who always consume in some fixed places and set their location-count as 1-3. And the other 3 cardholders always consume in different places with 10-15 locations.
We examine the performance on five datasets with different fraud rate which are 2%, 5%, 10%, 20%, 25% respectively.
Now, I have the questions about this definition:
- I can follow the feature generation but where are the labels actually? My best guess maybe in (3) at "two kinds of consume distribution for 10 cardholders", but I'm not sure this is about different data (authentic and fraudulent) for the same respective cardholders. I'd rather simply understand this that there are two groups of cardholders with the same data distribution among themselves. I think the point of this synthetic training data generation would to generate data which would be different somehow with respect to the targets, and just generating some independent features doesn't guarantee this. I cannot see in the definition above how the authors guarantee that two subsets of the generated data differ from analysis's perspective.
- What is "specified μ" and "fixed μ" mentioned respectively in (3)?