2

Lets say I have a very basic, binary classification problem and I use logistic regression. The logistic regression will give me a score (not a classification yet), between 0 and 1.

I can use sklearn's roc_auc_score to calculate the ROC easily by using roc_auc_score(y_train, predicted_scores). The function will find the best threshold for me.

However, if I want to check the ROC for my validation set, can I just use roc_auc_score(y_val, predicted_val_scores)? Because then it will look again for the best threshold right? Should I not find a way to use the same threshold as in the first function? Or am I overthinking this?

Shaido
  • 117
  • 7
user50466
  • 23
  • 3

1 Answers1

0

You write

I can use sklearn's roc_auc_score to calculate the ROC easily by using roc_auc_score(y_train, predicted_scores). The function will find the best threshold for me.

but roc_auc_score doesn't find the best threshold, it just measures the area under the ROC curve. Finding the area under the ROC curve doesn't involve finding a single, "best" threshold. See How to calculate Area Under the Curve (AUC), or the c-statistic, by hand for an explanation of how this works.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Thanks for your answer! And the of course the area under the curve does not have to do anything with a threshold, but to draw the ROC curve, you need to specifiy several cut-offs (as is stated in the thread you linked). I mean, predicted_scores needs to be converted to a binary classification outcome to calculate the sensitivity/specificity/tpf etc, and apparently roc_auc_score does this for me, since I only specify the predicted probabilities and the true Y (which is binary) – user50466 May 04 '20 at 12:49
  • Yes, your comment is an accurate summary of how this function works. – Sycorax May 04 '20 at 12:56
  • Then again, when I run the roc_auc_score function, it converts the probabilities to a binary prediction and uses a cutoff for this. Shouldnt I use the cutoff roc_auc_score uses on the training data on the validation set as well? Because right now I am just calling the function twice (once on training set, once on validation set) but Im not sure if I am allowed, since it can use a different cutoff when called on the validation set, which does not seem fair. Hope this is clear? – user50466 May 04 '20 at 13:03
  • The cutoffs depend on the data alone, so you’re fine. In other words, the AUC is a statistic based on ranks, and we can always rank probabilities without having to refer to another data set. You’re correct to worry about out-of-sample properties, but that’s actually captured by estimating the error of the AUC. AUC is a statistic, so it has an associated error. – Sycorax May 04 '20 at 13:33