1

I'm trying to build a classifier for my highly imbalanced binary data, and I'd appreciate some help on how to balance by results. The dataset has the following stats:

tabulate(classes)
  Value    Count   Percent
      0    133412     97.62%
      1     3247      2.38%

My dataset has 113 features. I'm using a boosting ensemble classifier with the RUSBoost algorithm (as my dataset is highly imbalanced). My weak learners are decision trees with a maximum of 5125 splits (1/16 of my training dataset examples). I'm using 300 learning cycles and a learn rate of 0.1. I get the following results (with 60% training and 40% testing):

accuracy: 0.99398
sensitivity: 0.87596
specificity: 0.99685
PPV: 0.87126
NPV: 0.99698

When plotting the ROC curve for my classifier (using test data), I get the following: enter image description here

As can be appreciated, the classifier is getting very high specificity (and NPV), but not-so-good sensitivity (or PPV). Hence, my question is:

How can I change my classifier in order to get a balanced sensitivity and specificity (and of course PPV and NPV)? For example, the values indicated in the ROC curve would be awesome.

Any suggestion is very appreciated!

DiogoT
  • 13
  • 3

1 Answers1

1

Relevant question for many application domains.

You can adjust the posterior probabilities $P(C \mid {\bf x})$ and $P(\neg C \mid {\bf x})$ by recalculating for a different prior distribution $P^\prime(C)$ and $P^\prime(\neg C)$. The correction formula is derived here.

Choose $P^\prime(C)$ and $P^\prime(\neg C)$ as to try to obtain the wished for sensitivity and specificity / PPV, NPV. Your desired optimum may not be achievable, but you can find the best fitting solution.

Match Maker EE
  • 1,701
  • 4
  • 15