Testing that training and test data sets (not vectors) have been drawn from the same population in R

Question

I have two data frames in R (training + test). The test set is not drawm splitted from one sample by myself but I got those two dataset and am supposed to built the model on the training set and make predictions with this model on the test set (without the target variable being in the validation set). I understand that predictions for the test dataset provided are only possible if we can confirm that training and test datasets have been follow the same distribution (are from one population).

In this [Non-parametric test if two samples are drawn from the same distribution post it has been discussed to use the Kolmogorov–Smirnov test. However, in R the ks.test seems to only be possible to conduct on vectors. Would I therefore have to do this test for each numeric variable independently or is there a test/way to do that on the whole data set at once?

If you randomly selected the validation data set from the joint (training + validation) data set, and our sample sizes are fairly large, it's reasonable to assume the validation data set is drawn from the same population. If sample sizes are relatively small, especially if the number of variables is large, your validation data set may indeed be non-representative, however. — jbowman, Oct 24 '18 at 03:45
Welcome to CV! Your wording is not entirely correct, as you cannot prove that two data sets were drawn from the same population. Not with statistics at least. Another point of confusion is that you claim to have a validation and test set, but not a train set. Did you mean either of these to be the train set and the other to be used for internal validation? Please edit your post to make it more clear what it is you are trying to achieve, rather than by what method you are trying to achieve it. — Frans Rodenburg, Oct 24 '18 at 05:52
@FransRodenburg thank you for pointing out my mistake in wording, I have edited it respectively :) — Sarah, Oct 24 '18 at 06:59
This changes the situation further, if your second data set is unlabeled, you cannot validate on it, so it is not really a validation set. How many observations are there in the labeled data set? Can you perhaps split that one into train and validation and then predict new labels on the second data set? — Frans Rodenburg, Oct 24 '18 at 07:19
Okay I see your point - I guess I meant test set then (as opposed to validation set). So in order to be able to draw conclusions from the model based on the training set to the test set (for which I would like to predict the label). — Sarah, Oct 24 '18 at 10:48

score 1 · Answer 1 · answered Oct 24 '18 at 07:21

If you compare the performance of different models on the validation set, you can already tell whether the validation set was largely unrepresentative of the train set. Namely, if you find large discrepancies between training and validation error, which you cannot resolve by dealing with overfitting (e.g. reducing the capacity of the model, regularizing the coefficients), then there is too large a difference between these two sets and you can conclude that they were perhaps not drawn from the same data-generating process.

Note that a data set that is lacking the variable of interest cannot be used as a validation set.

score 0 · Answer 2 · answered Oct 23 '18 at 21:02

If training and validation set are both drawn i.i.d from the same population, then machine learning models should not be able to separate both data sets.

Would I therefore have to do this test for each numeric variable independently or is there a test/way to do that on the whole data set at once?

Univariate analysis (ks-test, chi-squared, etc.) does not take combinations of attributes into account. Even if all variables follow the same distribution in traininig and validation set, you could still have different populations (or data sources)

Testing that training and test data sets (not vectors) have been drawn from the same population in R

2 Answers2