8

I have two independent samples of observations. From each sample I produce a statistic. Let's denote these as $\theta_1$ and $\theta_2$. I'd like to test the hypothesis that $H_0: \Theta_1=\Theta_2$, but I have these two constraints:

  • There is no analytical estimate of the distributions of $\theta_1$ and $\theta_2$ (the statistic is a product of some (computationally expensive) algorithm that operates on each sample.
  • Even under $H_0$, exchanging observations between the two samples in not sensible. Therefore, a permutation approach might reject $H_0$ erroneously.

My current idea was to bootstrap $\theta_1$ and $\theta_2$ independently and then estimate the distribution of $\Theta_2-\Theta_1$ from these two bootstrapped distributions by means of convolution.

Q1: Is this a valid approach?

Q2: Any reason why not the extend this to jackknifing (instead of bootstrapping) as well?

Q3: Any references to such 'two-samples' bootstrap?

Q4: Any recommended alternatives?

Trisoloriansunscreen
  • 1,669
  • 12
  • 25
  • I'm not sure why you need convolution. You could just bootstrap each simultaneously (if the sample sizes are big enough to support it) and then compute a direct bootstrap distribution of the difference. – Glen_b Mar 04 '14 at 22:10
  • @Glen_b: By bootstrapping "simultaneously" do you mean to take the differences $\theta_1 - \theta_2$ on each bootstrap iteration? Which is a bit like a paired test, even though the samples are not paired? – amoeba Mar 05 '14 at 00:13
  • 2
    I mean take the differences $\hat\theta_1-\hat\theta_2$. Within a bootstrap iterate, the computation is actually like an unpaired test, not a paired test ... indeed it's exactly analogous to the numerator of a two-sample t-statistic (where in that case, $\hat \theta_i$ is a sample mean). ... though if you mean 'paired' in the sense that we're pairing by bootstrap iterate, then yes, that's exactly what I mean, but it's not really pairing in the usual sense, since the two are quite independent. It's not necessary to do it this way, but it's certainly convenient. – Glen_b Mar 05 '14 at 00:23
  • @Glen_b - 1. in my case, each bootstrap iteration is computationally expensive. Therefore, convolution will be more computationally efficient use of the data, won't it? 2. I'm considering to use this with Jackknifing instead of bootstrapping, and there it's not really evident how to jackknife simultaneously (inflation factor etc). – Trisoloriansunscreen Mar 05 '14 at 06:28
  • 1. *in my case, each bootstrap iteration is computationally expensive.* -- Kind of important information to put in a question, in any case, so that people answering might offer computationally efficient alternatives. $\quad$ *Therefore, convolution will be more computationally efficient use of the data, won't it?* Hmm, I pondered this before when I responded. I don't know that it's actually more efficient. How does the reasoning go? 2. Important information to put in a question, then. – Glen_b Mar 05 '14 at 07:03
  • 1
    @Glen_b: 1. thanks, I added this info to the question. 2. Since the two samples are independent, one can resample pairs of $\theta_1$ and $\theta_2$ from the two bootstrapped distributions. This way we can have much more resampled pairs of $\theta_1-\theta_2$ than individual bootstrap iterations. I suspect that if the number of this second stage resamplings is really large, the result is the same as a convolution of the two distributions. – Trisoloriansunscreen Mar 05 '14 at 08:36
  • 1
    resampling the resamples contains no information not in the original sample -- it would surely be advantageous if it's faster to resample the resamples than to resample the original, but otherwise I'm not sure I see the additional information comes from. (You might be able to get some advantage with a smoothed bootstrap perhaps.) – Glen_b Mar 05 '14 at 08:43
  • @Trisoloriansunscreen this is exactly the problem I'm facing. did you figure this out eventually? – nivniv Feb 10 '22 at 20:15
  • @nivniv - I eventually followed the approach I was considering here. Given the sampling distributions of $\theta_1$ and $\theta_2$ and the assumption that these variables are independent, you can estimate the distribution of the difference either analytically (https://stats.stackexchange.com/a/83169) or by resampling the resamples (this makes sense if generating samples is expensive). – Trisoloriansunscreen Feb 11 '22 at 23:37

0 Answers0