Why ROC Curve on test set?

Question

I'm new to Machine Learning and Statistics so pardon me if I say anything ridiculous.

By "test set" I mean the set that we evaluate the final hypothesis and then report the final result (e.g. test error) that is an unbiased estimate of the corresponding out-of-sample result (e.g. out-of-sample error).

By "validation set" I mean the set that we use to do model selection or parameter tuning to choose out the final hypothesis. The best result found on the validation set is biased by definition (if you evaluate only one hypothesis on the validation set, then the validation set is a test set).

I am sorry for the two above lengthy paragraphs as I want to be sure that we are talking about the same thing. Now come the main question:

Why do we want to calculate ROC curve on test set?

In many other resources that I read, they calculated ROC curve on either training set or test set without a clear definition of "test set", so pardon me if I read it wrong. However, I'm still curious if in the case of the test set by my above definition, what is so the point of calculating ROC curve? Isn't the threshold choice made on the training set (which is perhaps heavily optimistically biased) or the validation set (which might be less optimistically biased)? Doesn't the test set become a validation set if we make threshold choice on it?

The procedure that sounds reasonable to me is that we calculate ROC curve on the validation set to do model selection / parameter tuning and threshold selection based on the ROC curve.

The Area Under the Curve (AUC) is a measure of discriminatory power. If you want an out of sample estimate of it then you have to estimate it on the test set? — , Mar 06 '17 at 05:32
I understand the confusion and think it is mainly caused by semantics. In the field I'm in we usually say 'development'-set to indicate the data which is used for model selection and tuning, while 'validation'-set is the set in which we check the out-of-sample performance or *external validity*. Regarding your Q, IMO the ROC or AUC as @fcop states, is a reasonable performance measure, but only describes discrimination (can the model discern between two groups), while the AUC does not assess whether the predicted probability are correct (calibration). So yes, you can use it, but not on its own. — IWS, Mar 06 '17 at 08:15
This is not what one normally mean by validation set, which is probably why you're confused. See the following Q&A for the "standard" definitions: http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set — Calimo, Mar 06 '17 at 08:34
Thank you very much for your replies, may I ask the "calibration" that @IWS mentioned is the "calibration plot" in [this](http://www.stat-d.si/mz/mz3.1/vuk.pdf)? — ntvy95, Mar 07 '17 at 00:47
The calibration plot in your reference is indeed the visualization of the "calibration" I referred to. After reading the post @calimo referenced to, I would like to point your and other readers' attention to it as well, as it probably captures CrossValidated's semantics better than my comment above. — IWS, Mar 07 '17 at 09:21

score 1 · Answer 1 · answered Dec 23 '20 at 12:37

Why do we want to calculate ROC curve on test set? In many other resources that I read, they calculated ROC curve on either training set or test set without a clear definition of "test set", so pardon me if I read it wrong.

You want to calculate the ROC on the test set because that's actually the set of data that can help you estimate generalized performance, as it was not used to train the model in any way. And that is the definition of a test set, it's the same in holdout or nested cross-validation: a set of data that was not used in any way to optimize the model being tested on it.

However, I'm still curious if in the case of the test set by my above definition, what is so the point of calculating ROC curve? Isn't the threshold choice made on the training set (which is perhaps heavily optimistically biased) or the validation set (which might be less optimistically biased)? Doesn't the test set become a validation set if we make threshold choice on it?

You can optimize a threshold during training (in fact, if you want to do it, you should do it in training/validation data). Often, however, this dichotomization is skipped altogether, leaving models to output probabilities or similar, continuous-valued, scores. Then, computing the ROC gives you a good idea of the predictive potential of your model, without ever establishing a threshold.

Why ROC Curve on test set?

1 Answers1