Appropriateness of nonparametric bootstrap methods to assess difference between two groups

Question

This question is motivated by the discussion of this earlier question.

I have two samples $X$ and $Y$, where both samples have $n$ elements. Both samples represent optimal solutions returned from two different stochastic optimisation solvers on the same optimisation problem: results from one solver are in sample $X$, and results from the other solver are in sample $Y$. Each element in the samples are independent of each other, and both samples are independent of each other.

I am wondering what the appropriate method is to compare samples $X$ and $Y$ to detect if one solver tends to provide better optimal solutions than the other solver for the particular optimisation problem.

I have two ideas:

estimate a sample mean $\bar{d}$ of the element-wise differences of $X$ and $Y$ and check if the 95% confidence interval of this sample mean includes zero.
estimate a sample statistic $\bar{m}$ of the difference in sample means of $X$ and $Y$ and check if the 95% confidence interval of this sample statistic includes zero.

For either idea, we assume that one solver is better than the other solver if zero does not fall within that idea's estimated confidence interval of the relevant sample statistic.

In the first idea, I think the algorithm would be:

produce a single bootstrap resample with replacement of $X$, called $X^{*}$ with $n$ elements
produce a single bootstrap resample with replacement of $Y$, called $Y^{*}$ with $n$ elements
take the element-wise difference between $X^{*}$ and $Y^{*}$ to give $D^{*}$
calculate a resample mean $\bar{d}^{*}$ of $D^{*}$ and store $\bar{d}^{*}$
repeat steps 1 to 4 a large number of times (say 10000) to give a set of $\bar{d}^{*}$ values
use the empirical distribution of the $\bar{d}^{*}$ values to estimate the 95% confidence interval of the sample mean difference $\bar{d}$

For the second idea, I think the algorithm (rewritten from this question) would be:

produce a single bootstrap resample with replacement of $X$, called $X^{*}$ with $n$ elements
produce a single bootstrap resample with replacement of $Y$, called $Y^{*}$ with $n$ elements
calculate a resample mean $\bar{x}^{*}$ for $X^{*}$
calculate a resample mean $\bar{y}^{*}$ for $Y^{*}$
take the difference $\bar{m}^{*}$ between $\bar{x}^{*}$ and $\bar{y}^{*}$, and store $\bar{m}^{*}$
repeat steps 1 – 5 a large number of times (say 10000) to give a set of $\bar{m}^{*}$ values
use the empirical distribution of the $\bar{m}^{*}$ values to estimate the 95% confidence interval of the sample statistic $\bar{m}$

I believe that idea 1 and idea 2 are nonparametric bootstrap approaches which are analogous to the paired samples t-test and the independent samples t-test respectively. I think that the second idea is the more appropriate idea, since I think the first idea assumes the specific pairings of elements between $X$ and $Y$ are important when they are in fact not important.

Of course, my post assumes that looking at means is the most sensible way of comparing results; I am open to discussion over other point statistics or interval statistics that more knowledgable people may be aware of.

Am I right in my thinking that the second idea is the appropriate way of comparing the optimal solutions provided by these two solvers on the same optimisation problem?

Perhaps this question should relate to tests of precision and accuracy? — Carl, May 04 '19 at 11:21

BruceET · Accepted Answer · 2019-04-25T16:05:43.950

Data for illustration: Here are summaries of fake data to use for illustration. They are mildly non-normal, integer data; two independent samples with $n=20.$

summary(x); length(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   99.0   101.0   102.0   102.2   103.0   108.0 
[1] 2.074279  # sd
summary(y); length(y); sd(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   97.0    98.0    99.0    99.3    100.0   103.0 
[1] 1.688974  # sd

I do not necessarily recommend using t procedures on these data, but for later reference, the Welch 95% confidence interval is $(1.738, 4.162),$ which does not include $0.$

t.test(x,y)$conf.int
[1] 1.737508 4.162492
attr(,"conf.level")
[1] 0.95

The 95% CI produced by the two-sample Wilcoxon procedure is $(2.0, 4.0),$ but it is not clear how successfully this procedure adjusted for the many ties in the data rounded to integers.

Nonparametric bootstrap CIs: There are very many styles of nonparametric bootstrapping, each with at least a few enthusiastic advocates. (It seems your first method anticipates equal sample sizes from the two groups, and may be more appropriate for paired data than for two independent samples.)

I prefer what I view as a modification of your second method, mainly because I can see a clear rationale for the process, as described below.

We seek a 95% CI for $\delta = \mu_x - \mu_y.$ Its point estimate is $\bar D = \bar X - \bar Y = 2.95.$ If we knew the distributions of $X$ and $Y,$ we could find bounds $L$ and $U,$ such that $$P(L < Q = \bar D - \delta < U) = 0.95.$$ Then a 95% CI for $\delta$ would be of the form $(\bar D - U,\, \bar D - L).$

Not knowing the distributions, we enter the 'bootstrap world' (in which we use $*$'s to indicate resampling) to find useful estimates $L^*$ and $U^*$ of $L$ and $U,$ respectively. At each iteration, we take re-samples of size $n$ with replacement from the $X_i$'s and $Y_i$'s, and take their means $(\bar X^*$ and $\bar Y^*)$ to get $$Q^* = \bar D^* - \delta^*,$$ where $\bar D^* = \bar Y^* - \bar X^*)$ and we use $\delta^* = \bar D$ (observed from the original data) as a temporary proxy for the unknown $\delta.$ This re-sampling procedure is iterated to find a large number $B$ of values $Q^*.$

Now back in the 'real world', we use quantiles .025 and .975 of the $Q^*$'s as the required estimates $L^*$ and $U^*,$ respectively. Then the 95% nonparametric bootstrap CI for $\delta$ is of the form $(\bar D - U^*,\, \bar D - L^*).$

For our data, this bootstrap procedure is shown below in R, where we use .re for $*.$ Also, we use objects with a's to indicate the various means. (The 'for-loop' structure is not optimally efficient in R, but it is easily understood by those unfamiliar with R, and the short lines fit nicely on this page.)

The resulting 95% nonparametric bootstrap CI is $(1.80, 4.05),$ which does not include $0.$ So we can reject the null hypothesis that the two distributions have equal means at the 5% level.

set.seed(422)
B = 10000;  q.re = numeric(m)
ad.obs = mean(x) - mean(y)
for(i in 1:B) {
  ax.re = mean(sample(x, n, rep=T))
  ay.re = mean(sample(y, n, rep=T))
  ad.re = ax.re - ay.re
  q.re[i] = ad.re - ad.obs  }
LU = quantile(q.re, c(.975, .025))
CI = ad.obs - LU;  CI
97.5%  2.5% 
1.80  4.05

Notes: (1) The bootstrap CI is a little shorter than the t CI given earlier. In effect, using a t procedure implies nearly normal data, with possible values on the real line. The bootstrap procedure 'has no clue' that any values can be outside the interval $[97, 108].$

(2) An equivalent procedure would be to get the bootstrap distribution of $\bar D^*.$ Then at the end use $2\bar D_{obs} - \bar D^* =\bar D_{obs} - (\bar D^* - \delta^*) $ (since, by definition, $\delta^* = \bar D_{obs}).$ [This is the version shown by John Rice in his math stat book. A pedagogical disadvantage of this is that students ask where the "2" comes from.] In either version this method is said to provide bias protection against asymmetrical bootstrap distributions.

Permutation test. If your main objective is to test the null hypothesis that the two 'solvers' are the same, you might do a nonparametric permutation test instead of a bootstrap confidence interval.

Suppose all 40 of the observations in my example are put into a vector v with groups distinguished by 1's and 2's in a vector g. Then, for reference, we can get the P-value $0.0000179$ of a Welch t test with t.test(v ~ g)$p.val.

v = c(x,y);  g=rep(1:2, each=20)
t.test(v~g)$p.val
[1] 1.793056e-05

Because the data are not normal this test may not be appropriate, but we can do a nonparametric test with the metric $d = \bar X - \bar Y,$ the observed value of which is $d_{\text{obs}} = 2.95.$ If the null hypothesis is true, then it should not matter which 20 of the 40 observations were assigned to group 1. There are ${40 \choose 20} = 137,846,528,820$ possible permuted assignments. Not all of them will have different values of $d,$ partly because of the ties among the observations. Without extra parameters, the R code sample(g) randomly permutes the 1's and 2's in g. Results of four such permutations of the data between groups are shown below:

A permutation test would find $d$ for each permutation, giving the permutation distribution of $d,$ from which we could get the P-value of the permutation test. By sampling a large number of the possible permutations, as in the R code below, we can get a sufficiently good approximation of the permutation distribution for a useful test.

set.seed(1776)
B = 10^6; d = numeric(m); d.obs = mean(x)-mean(y)
for (i in 1:B) {
  g.prm=sample(g)
  d[i] = mean(v[g.prm==1]) - mean(v[g.prm==2]) }
mean(d >= d.obs)
[1] 7e-06

The P-value $\approx 0.00001$ is the proportion of simulated values of $d$ that exceed $d_{\text{obs}}.$ The simulated permutation distribution is shown below.

Among the $B = 10^6$ values of $d$ simulated, there were only 64 uniquely different values. The resulting discrete simulated permutation distribution is roughly normally distributed; the best-fitting normal density is shown with dashes.

Many thanks for your thorough answer. May I ask why you have included $\delta^{*}$ in the calculation of the statistic $\bar{Q}^{*} = \bar{D}^{*} - \delta^{*}$, and the further assumption of $\delta^{*}=\bar{D}$ in your discussion on the bootstrapped CI method? My naïve thought is that the $\bar{D}$ term would simply shift the empirical cumulative distribution of $\bar{Q}^{*}$ by a constant, rather than changing its shape. Perhaps this shift without a shape change wouldn't be the case if we could sample $\delta^{*}$ rather than being forced to rely on a single value of $\bar{D}$? — the-alleged-car, Apr 25 '19 at 10:03
You're right about the 'shift'. But that's 'unshifted' at the end by the subtraction of quantiles from $\bar D_{obs} $ Use of $\bar D_{obs}$ for $\delta^*$ is a _definition_, not assumption. Purpose of bootstrap proc is to assess variability of $\bar D$ as an estimate of $\delta.$ We don't know the dist'n of $\bar D - \delta$ so the bootstrap proc emulates it with $\bar D^* - \delta^*.$ I added Note (2) about bias corr. to my Answ. Your 2nd meth is usually OK. To see that the two 'versions' of my meth (in note) are equiv, or sometimes diff from your 2nd, try all 3 with several datasets. — BruceET, Apr 25 '19 at 15:55

Appropriateness of nonparametric bootstrap methods to assess difference between two groups

1 Answers1

Linked