0

I have an imbalanced dataset, with the following stats:

  Value    Count   Percent
      0    133412     97.62%
      1     3247      2.38%

I have created a classifier (boosting ensemble classifier together the RUSBoost algorithm, with 300 learning cycles and a learn rate of 0.1; using weak learners as decision trees with a maximum of 1300 splits) using 50% holdout cross-validation [1]

By applying the resulting classifier to the test dataset (and comparing the resulting classification with the real classes), I get the following results: 2

Sensitivity: 86.7%
Specificity: 99.8%
NPV: 99.9%
PPV: 86.4%

Overall, in my test dataset I get the following ROC curve (modified) [3]: enter image description here

The red-point indicates the classifier overall results (described previously). I now want to balance those result and make my model more sensitive (e.g., 85% specificity and 99% sensitivity). I have tried changing the prior probabilities and the costs during training, but it doesn't affect my test results. I'm now trying to use the prediction scores to achieve the higher sensitivity. Hence, my question is:

How can I adjust my scores so that my classifier performs with higher sensitivity (and lower specificity)? And, in terms of cross-validation, where should that adjustment be made? Recommendations are appreciated! I'm also confused whether that adjustment should be done in the testing or training set.

Many thanks, DT


Source code in MATLAB:

[1]

% Create template tree
template_tree = templateTree('MergeLeaves', 'off', 'MaxNumSplits', 1300, ...
    'NumVariablesToSample', 'all', 'Prune', 'off');

% Create Ensemble model
esemble_model = fitcensemble(holdout_train_features, holdout_train_classes, ...
    'Method','RUSBoost', 'NumLearningCycles', 300, ...
    'Learners', template_tree, 'LearnRate', 0.1);

2

%% Perform validation on test dataset
% Apply model to test dataset
[obtained_classes, scores] = predict(esemble_model, holdout_test_features);

% Compare obtained classes with the real classes using the confusion matrix   
holdout_validation_results = confusionchart(holdout_test_classes, obtained_classes);
TN = holdout_validation_results.NormalizedValues(1,1);
TP = holdout_validation_results.NormalizedValues(2,2);
FP = holdout_validation_results.NormalizedValues(1,2);
FN = holdout_validation_results.NormalizedValues(2,1);
accuracy = (TP + TN)/(TP + TN + FP + FN);
sensitivity = TP/(TP + FN);
specificity = TN/(TN + FP);
PPV = TP/(TP + FP);
NPV = TN/(TN + FN);

At this stage, I honestly don't know how MATLAB uses the classifier / scores to get the prediction classes. I do know that the ensemble fitting function has a score transform input.

[3]

% Compute the ROC curve using the prediction scores
[X, Y, T, AUC, OPTROCPT] = perfcurve(holdout_test_classes,score(:,2), 1);
plot(1-X,Y)
hold on
xlabel('Specificity') 
ylabel('Sensitivity')
DiogoT
  • 13
  • 3
  • **1)** ROC curves usually have 1-specificity on the x-axis. **2)** What do you mean that you have a result of 86.5% sensitivity and 99.7% specificity? You must be selecting some threshold in order to turn your probabilistic predictions into categories...so pick a different threshold. **3)** Obligatory comment about proper scoring rules as opposed to threshold-based metrics: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email (Kolassa's answer is good, and I like his link and the links contained in that link). – Dave Aug 10 '20 at 22:15
  • @Dave thanks for your helpful reply. Based on what you sent me, I have improved my question, giving more information. I honestly don't know what's behind the `predict` function and how it outputs the predicted classes - I don't know if it is a threshold selected during training, if it just returns the higher score. I couldn't find the answer in MATLAB's documentation sadly. I guess my question now is how to apply the scoring rules? And do I make those rules in order to have a more sensitive model? – DiogoT Aug 11 '20 at 08:56
  • What I understood is the following: In order to get my classification I'll always have to pick a threshold to make the decision. However, applying score functions makes that threshold picking more easier. Is that it? – DiogoT Aug 11 '20 at 12:40

0 Answers0