I am working on a binary classification problem on an imbalanced data where the majority class is about 90% 'no' and the minority class is about 10% 'yes' of the total data.
Iteration 1: I randomly split my data (about 1200 rows) into 70% training and 30% test data and trained a random forest classified to get the probability for each of the two classes. I wanted to see how the model performance of a 'yes' or 'no' changes with the cutoff probability used to deciding the majority and minority class in test data. So I started with cutoff probability of 0.01, got the 'yes' or 'no' prediction and the corresponding AUC, sensitivity and specificity. In incremented the cutoff probability to 0.02 and noted the AUC, sensitivity and specificity. I repeated this each time incrementing the probability by 0.01 until I reached a cutoff probability of 1.00. This completed the first iteration.
Iteration 2 to 1000: I repeated the above experiment 1000 times by taking random 70% training, 30% test split each time.
Finally for each cutoff probability between 0.01 and 1.00, I took calculated the average values of the AUC, sensitivity and specificity and plotted them on the graph below.
Since the data is imbalanced, the sensitivity is high, lying between 0.95 and 0.99 as shown by the orange line. The AUC is highest when the cutoff probability is about 0.11 where as the specificity is highest when the cutoff probability is about 0.8. I observed similar pattern in graphs with other classifiers such as XGB, GBM, Adaboost etc. Each of these algorithms have their own distinct characteristic curves, but the overall pattern remains the same.
Question 1: The AUC is highest when the cutoff probability is about 0.11 where as the specificity is highest when the cutoff probability is about 0.8. Why does the optimal cutoff for AUC differ from that of specificity.
Question 2: Given the above information, if we are to manually choose a cutoff instead of letting the algorithm automatically decide then, what cutoff should we use for selecting the best model. Assume that no more feature engg. and re-modeling would be done.