2

I am working on a classifier which stratifies a population of samples into different classes.

The class distribution (ground truth) is imbalanced, and the prevalence of each class is:

$$\begin{matrix}Label&Prevalence\\C_1&0.14\\C_2&0.17\\C_3&0.26\\C_4&0.43\end{matrix}$$

The classifier is based on Random Forest.

At the moment my pipeline is the following:

  1. Feature selection on the dataset - in this case I am testing:
  2. On the feature-selected dataset, exhaustive search of Random Forest parameters (number of trees and minimum number of samples required to have a split) using GridSearchCV, in particular:
    • a 3-Fold CV classification for each set of parameters, where each class has weight based on its prevalence
    • each 3-fold CV classification is evaluated using a macro-averaged F1-score (in this way I would like to give the same importance to all classes, independently of prevalence)
    • evaluation of distribution of scores using boxplots to define the optimal parameters

However, with this pipeline I am able to achieve improvements on the overall accuracy and on metrics for bigger classes. Instead, the minority class gains specificity but not enough sensitivity.

Is there a way to approach the problem in order to increase the sensitivity of the minority class?

gc5
  • 877
  • 2
  • 12
  • 23
  • Just a guess: from my understanding, macro-averaged-F1 scores might not be the best choice for your goal. You could try to replace its internal precision/recall with e.g. sensitivity/specificity or AUC of the ROC curve if possible (but can't guarantee that this will help). – geekoverdose Jul 08 '16 at 10:01
  • 1
    This is the perfect answer for your question: http://stats.stackexchange.com/a/158030/78313 – Metariat Jul 08 '16 at 10:30

0 Answers0