Many machine learning papers I read follow something like this procedure:
- Split into test and train set,
- train different models on the train set and evaluate their performance on the test set,
- report the scores of the models on these test sets,
or
- do a cross validation with every model on the full dataset,
- report the scores of the models as mean over the cross-validation results,
To me both of these approaches are obviously flawed, since the model choice and estimation of the empirical risk are based on the same test sets. In my opinion reporting "model x was the best" would be a correct assessment, but reporting "model x performs with score y" is data dredging.
Some examples:
My questions are:
- If I am right, why do respected researchers, journals, reviewers ignore this?
- Should one contact the journals and ask for corrections?
- Why do reviewers not simply ask researchers to estimate the empirical risk on an hold-out set?