If we have two data sets $X_1,\ldots,X_m$ and $Y_1,\ldots,Y_n$, each i.i.d., and wanted to determine whether $\mathbb{E}[X_1] = \mathbb{E}[Y_1]$ or not using $\bar{X}_m - \bar{Y}_n = \hat{\Delta}_{m,n}$, a resampling procedure would treat the observed numbers as if they came from the same data set and randomly assign the collection of random numbers observed to the $X$ group and the $Y$ group and compute many $\hat{\Delta}_{m,n}$. The observed difference would then be compared to the Monte Carlo sampled differences to compute a $p$-value.
If one were to go even further and say they want to decide if the two data sets have the same distribution or not, the same procedure could be applied but using the Kolmogorov-Smirnov statistic (or any other metric comparing distributions).
How would I theoretically justify these procedures? The questions that clearly would need to be asked and answered would be: (1) Does the test have appropriate behavior under the null hypothesis? (2) Is the test consistent under the alternative hypothesis? A nice question to answer would be (3) How would the power of the test be characterized? If I wanted to get into mathematical details about why these procedures worked, what would I see?