0

Many machine learning papers I read follow something like this procedure:

  1. Split into test and train set,
  2. train different models on the train set and evaluate their performance on the test set,
  3. report the scores of the models on these test sets,

                  or

  1. do a cross validation with every model on the full dataset,
  2. report the scores of the models as mean over the cross-validation results,

To me both of these approaches are obviously flawed, since the model choice and estimation of the empirical risk are based on the same test sets. In my opinion reporting "model x was the best" would be a correct assessment, but reporting "model x performs with score y" is data dredging.

Some examples:

My questions are:

  • If I am right, why do respected researchers, journals, reviewers ignore this?
  • Should one contact the journals and ask for corrections?
  • Why do reviewers not simply ask researchers to estimate the empirical risk on an hold-out set?
MarianD
  • 1,493
  • 2
  • 8
  • 17
PascalIv
  • 404
  • 4
  • 10
  • You are right this third step, '3. report back the best score', that should be done with an additional [validation/test phase](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set). The language might be confusing. In your question the first two steps might be considered one single step. Basically you have 'step 1 train the model, step 2 verify/test the model' (the confusing part is that the training step can also be split into steps). ... – Sextus Empiricus Mar 16 '21 at 11:18
  • ... Because of this confusing language around, your question becomes difficult to answer without exact references to the false practice. – Sextus Empiricus Mar 16 '21 at 11:19
  • @SextusEmpiricus Thanks, I tried to clarify it a bit. Is it clearer now? Exact reference would be the first paper claiming: "that attention-based SMILES encoders significantly surpass a baseline feedforward model utilizing Morgan (circular) fingerprints (Rogers & Hahn, 2010) (p < 1e-6 on RMSE)" They used 7 models, did a 25 fold CV for each to evaluate the performance. I am claiming they have an optimistic bias in their RMSE estimation and should have used nested cross validation. – PascalIv Mar 16 '21 at 11:48

0 Answers0