How to guarantee the test set is "independent"?

Question

In Machine Learning (ML) tasks, one splits the dataset into training and test sets. We train the ML model based on the training test, and then we evaluate the performance of the model with the test set.

It is always crucial to have "independent" test set, which are those samples that the model has not seen during the training phase.

In research, it is significant to prove that the test set is independent. However, this question arises:

How do we guarantee the "independence" of our test set?

You select it randomly--this guarantees independence *by construction.* That's the entire point of random selection! — whuber, Feb 11 '22 at 17:02
Building on that, cross validation is popular because it help guard against getting unlucky with your test set. Maybe you'll get unlucky once, but that gets washed out by performance on the other folds. — Dave, Feb 11 '22 at 17:08
@whuber thank you. But my question is more on "how" to prove to the reviewers of a journal, for example, that you have randomly split the dataset. Is there a method that can show the independence of the training and test sets (as a number, something similar to correlation)? — Hamed, Feb 11 '22 at 17:20
@Dave thank you for your comment. I understand your point, but how do we connect the evaluation of the cross-validation to independent aspect of the test set? In other word, how do we answer if someone interrogates about the validity of our work. — Hamed, Feb 11 '22 at 17:28
You prove you have randomly split a dataset by documenting the procedure you used. If, for instance, you used a computer's pseudorandom number generator, then you explain that. Those PRNG's have been tested and therein lies the guarantee you seek. — whuber, Feb 11 '22 at 17:34

How to guarantee the test set is "independent"?

0 Answers0