Why do we reserve a test set for final model evaluation?
Let's say you train a model following nested k-fold cross-validation and that you end up with one really good model among many that you tried as well as an estimate of its generalization performance. Then one might even choose to retrain this best model on all of the available training data (not reserving any for validation).
Finally literature has always told me to evaluate it on the test set to get some performance metric. Then what? You already arrived at your best model via the cross validation scheme, so what is this performance metric used for?
In a machine learning competition like Kaggle it is useful to sort competitors by their performance on a fixed test set and pick a winner. But if you are developing a model for some practical application within a company, then what does test set performance give you if you already arrived at your best model via cross validation along with an estimate of its generalization performance?
Looking at test set performance and repeatedly tuning the model is risking overfitting the model to the test set. So unless you're doing this risky tuning dance back and forth (which you obviously shouldn't if you don't want to risk optimistically biasing your model), then what is the point of a test set at all?