Identifying data generating process by testing

Question

Assume one has two data samples: the first one , $X = x_{1}, \dots, x_{n}$ and second, $Y = y_{1}, \dots, y_{m}$, with $m \ll n$. I aim to check if the data $Y$ was generated by the same data generating process (DGP) as $X$ did. One direct approach is using KS test for those data sets and if a p-value is very small, then the hypothesis about the same DGP is rejected.

Another approach which comes to my mind is the following:

From the first set $X$ we bootstrap some samples of the size $m$. Then perform multiple test with data sample $Y$.

What would be a correct procedure to make it? I do not think that just averaging of p-values would make much sense...

Edit: I am aware that, given the null-hypothesis is true, the p-value is a uniformly distributed random variable and, therefore, for a significance level $\alpha$ the $\alpha*100 \%$ of p-values will be less than $\alpha$ and $(1-\alpha)*100 \%$ will be greater.

(1) The KS test does not apply. You need its Lilliefors variant for two samples. (2) Neither approach would be appropriate if many ties occur in either dataset. (3) When $m$ is really small, the $p$ value of tests based on the empirical distributions are unlikely to be uniformly distributed: they will be nearly discrete. (4) Usually, to succeed with tests like this, you will want to be as specific as possible about what characteristic(s) of the DGP might have changed. But you can adapt general methods, such as [`ecp`](https://cran.r-project.org/web/packages/ecp/index.html), to this problem. — whuber, Jul 27 '20 at 14:32
Dear @whuber, Assume that $m$ is not "very small". Then, if the null hypothesis is wrong, wouldn't it be the case that all p-values of the tests will be rather very small? — ABK, Jul 27 '20 at 14:54
It depends on *how* the null hypothesis is wrong. The KS null includes the assumptions that (1) the reference distribution is specified with certainty (not estimated from data); (2) the reference distribution is continuous; and (3) the data are an iid sample from the reference distribution. If *only* (3) is violated, the p-value distribution will be what you expect. As an example where the p-values will *not* be "rather very small," run this little `R` experiment: `hist(replicate(1e3, ks.test(rpois(5, 1), pnorm)$p.value), breaks=30)`. It compares a Poisson(1) variable to the standard Normal. — whuber, Jul 27 '20 at 17:25
As an example of a non-continuous distribution of p-values when the two distributions are the same (and are continuous!), look at this simulation: `hist(replicate(1e3, ks.test(rnorm(5), rnorm(5))$p.value), breaks=30)`. Both samples (of size 5) are from a standard Normal distribution, but the p-value distribution is incredibly discrete. — whuber, Jul 27 '20 at 17:28

score 1 · Accepted Answer · answered Jul 27 '20 at 17:12

Your first approach will be fine; see for example here for more background. The KS test works for unbalanced data sets.

(And yes, you ought never to average p-values, at least not by taking an arithmetic mean. If you do decide to run multiple draws, consider Fisher's method. But you don't need the sampling approach.)

Identifying data generating process by testing

1 Answers1