0

I've trained a binary classification model which outputs a "probability" between (0,1).

During testing and validation, I use the ROC to measure the performance of the model. Also, I use the ROC to determine the threshold on which to cutoff false vs true predictions (e.g. I set the target to under 15% FPR).

When creating a model for production, I thought it would be ideal to train on all the available data set (e.g. no test nor validation split). Now, without test or validation split I don't have a ROC for the final model so I am without a threshold to interpret the model's output.

Is it valid to use the ROC obtained during testing? Should I calculate a new ROC of the final model over instances observed during training?

Is there something fundamentally wrong in my approach?

BaldML
  • 3
  • 3

1 Answers1

0

Yes, you can train on the whole data and report the test (or cross validation) ROC as the generalization error, but this ROC will probably be optimistic.

It would be better to estimate the generalization error on the model trained on the full data by doing cross validation. See this answer.

arinarmo
  • 356
  • 2
  • 5