1

I have been using GridSearchCV to tune the hyperparameters of three different models. Through hyperparameter tuning I have gotten AUC's of 0.65 (Model A), 0.74 (Model B), and 0.77 (Model C).

However when I return the "best_score_" for each grid search I am getting the scores of 0.72 (Model A), 0.68 (Model B), and 0.71 (Model C).

I am confused about why these scores are noticeably different, for example Model A has the weakest AUC but the strongest "best_score". Is this ok? Does this mean that more tuning likely needs to be done?

Thanks!

Joe
  • 89
  • 3
  • How are you obtaining the first set of AUCs? What metric are you using for the grid search (i.e., what is `best_score_`?) What kinds of models are A, B, C? – Ben Reiniger Dec 07 '21 at 22:39
  • Agree with @Ben Reiniger - it would be beneficial to define what best_score is. My guess is that it's the best score seen for that model over all iterations used -- for that model, and AUC reported is probably an average. If you knew what best-score was, you would probably know the answer. –  Dec 07 '21 at 22:48
  • I am obtaining the AUCs through a RandomForestClassifier. The models contain biomarkers to predict a disease. The best_score_ is the average of all cv folds (mean cross validation score). @BenReiniger – Joe Dec 07 '21 at 23:14
  • But are the AUCs from a separate test set, or...? Did you use the default `scoring`, or `"roc_auc"`? – Ben Reiniger Dec 07 '21 at 23:28
  • I used the default scoring...should I have used "roc_auc"? AUCs are from a seperate test set. @BenReiniger – Joe Dec 07 '21 at 23:41

1 Answers1

3

There are two main issues here, in my mind.

  1. You're comparing accuracy and AUROC. The default scoring in GridSearchCV uses the model object's score method, which is accuracy for classification models like RandomForestClassifier. There's no guarantee that two metrics agree on the best model, and accuracy isn't a great metric. One specific possibility here is that model A does a poor job at rank-ordering compared to the others, but the others perform poorly at the default classification cutoff of 0.5 used for the accuracy metrics.

  2. You're comparing test set performance with hyperparameter-selection scores. The best_score_ is optimistically biased because of the selection process. If one of the selections resulted in a more-overfit model, it might have worse test score drop than others.

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15