1

Lets say for fraud detection which has two labels for each transaction.

  1. Fraud
  2. Non fraud

In real world scenario we usually get more number of examples of Non fraud data points and very low number of fraud data points. Lets assume the ratio of Non fraud: fraud is 80:20. So my question is even if I build any classifier my model will predict the majority label but I know that data itself is not well distributed. So for such scenarios what should be the approach.

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22
  • 2
    There is a lot of posts on this site about unbalanced classes, search the site for this! In particular: https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression since logistic regression could be a good starting method for your problem. – kjetil b halvorsen Jul 31 '17 at 11:17
  • SMOTE algorithm is a popular choice. See e.g. here https://www.jair.org/media/953/live-953-2037-jair.pdf and here https://stats.stackexchange.com/questions/234016/opinions-about-oversampling-in-general-and-the-smote-algorithm-in-particular – tosik Aug 03 '17 at 19:51
  • Avoid the problem by using a probabilistic model, like logistic regression? – kjetil b halvorsen Aug 03 '17 at 20:56

3 Answers3

0

Two approaches I've read about are down-sampling the majority class, and changing the cost of miss-classifying to more severely penalize missing the minority class (fraud, in this case).

Zhenya
  • 1
0

One option is to down-sample you majority class - i.e. randomly throwing away events belonging to the majority class. That is of course feasible only if you have large enough data to afford this.

An alternative is to up-sample the minority class; a popular algorithm to generate synthetically data for the minority class is SMOTE (https://www.jair.org/media/953/live-953-2037-jair.pdf). I find a python implementation of it here: https://github.com/scikit-learn-contrib/imbalanced-learn although I was able to use it successfully only when using Support Vector Machines as a classifier. It did not really make a difference with other classifiers.

famargar
  • 789
  • 1
  • 5
  • 24
0

Another option, e.g. used in the R mixOmics package, is the option to base fitness measures during model parameter tuning not only on overall misclassification rate, but also on per-class misclassification rate. This will allow customised tuning and automatic leveraging of classes.

CarlBrunius
  • 138
  • 1
  • 11