Computing probability distributions over bootstrap samples for two statistics

Question

I have a data set $x= c(0.9575,0.4950,0.1080,0.9359,0.6326)$ and I'm trying to compute the probability distributions for the statistics $\bar X^* - \bar X$ and $\sqrt n(\bar X^* - \bar X)/s^*$, over all bootstrap samples of size 5 (the same size as $x$).

My approach, via R, is to iterate over all $5^5$ possible bootstrap samples, compute the values of the statistics in each case*, and then count up the unique values of the statistics and add up their probabilities (which are $y/5^5$, where $y$ is the number of times a given statistic value appears in the "big" list of statistic values of length $5^5$).

See below for my work.

*Note that the second statistic has some cases involving division by zero, so I have an if statement in my code to avoid that.

Questions:

Have I correctly programmed what I set out to do?
I'd like to improve my conceptual understanding of the difference between $\bar X^* - \bar X$ and $\sqrt n(\bar X^* - \bar X)/s^*$. My suspicion is that the difference between the two statistics is a function of the (variability in the) original data set, and bootstrap sample size, and that nothing can be said in general about whether one tends to have more variance than the other. Am I wrong about this?

Code for $\bar X^* - \bar X$:

x= c(0.9575,0.4950,0.1080,0.9359,0.6326)
xb=mean(x)
val=rep(0,5^5)
ns=0
for(i in 1:5){
 for(j in 1:5){
  for(k in 1:5){
   for(l in 1:5){
    for(m in 1:5){
     xst =c(x[i],x[j],x[k],x[l],x[m])
     ns=ns+1
     val[ns] = mean(xst)-xb
    }
   }
  }
 }
}
vuniq = sort(unique(val))
probability = rep(0.0,length(vuniq))
count=0
for(j in 1:3125){
 for (i in 1:length(vuniq)){
  if(val[j] == vuniq[i]){
   probability[i]=probability[i]+1.0/3125.0
   count=count+1
  }
 }
}
probability = probability/3125.0
plot(vuniq,probability,type='h',main="Distribution of Bootstrap Mean\n minus Sample Mean",xlab="Statistic (Bootstrap Mean minus Sample Mean)",ylab="Probability (Mass)")

Graph for $\bar X^* - \bar X$:

Code for $\sqrt n(\bar X^* - \bar X)/s^*$:

x= c(0.9575,0.4950,0.1080,0.9359,0.6326)
xb=mean(x)
sqrt5 = sqrt(5)
val=rep(0,5^5)
ns=0
for(i in 1:5){
 for(j in 1:5){
  for(k in 1:5){
   for(l in 1:5){
    for(m in 1:5){
     xst =c(x[i],x[j],x[k],x[l],x[m])
     ns=ns+1
     if (sd(xst) == 0) {
      next
     }
     val[ns] = sqrt5*(mean(xst)-xb)/sd(xst)
    }
   }
  }
 }
}
vuniq = sort(unique(val))
probability = rep(0.0,length(vuniq))
count=0
for(j in 1:3125){
 for (i in 1:length(vuniq)){
  if(val[j] == vuniq[i]){
   probability[i]=probability[i]+1.0/3125.0
   count=count+1
  }
 }
}
probability = probability/3125.0
plot(vuniq,probability,type='h',main="Distribution of Difference of Means,\n Scaled by Square Root\n of Bootstrap Variance over Sample Size",xlab="Statistic (Bootstrap Mean minus Sample Mean), Scaled",ylab="Probability (Mass)")

Graph for $\sqrt n(\bar X^* - \bar X)/s^*$:

What is the goal? How are you accounting for the fact that the bootstrap distribution may disagree with the sampling distribution? — Frank Harrell, Sep 28 '21 at 12:03

Pitouille · Answer 1 · 2021-09-28T11:45:32.870

Your loop covers the different possibilities indeed. However, bootstrapping allows you to do more since it imitates how samples are drawn from the population. For instance, it can simulate the “with replacement” aspect of the process. There are some R functions that ease your work and allow you to create more "random" sample. For instance, the function sample:

x <- c(0.9575, 0.4950, 0.1080, 0.9359, 0.6326)
X <- x
for(i in 1:10000) {
  X <- c(X, mean(sample(x, replace=TRUE)))
}
hist(X)

or the boot package:

x <- c(0.9575, 0.4950, 0.1080, 0.9359, 0.6326)
library(boot)
myFunc <- function(data, i){
  return(mean(data[i]))
}
bootMean <- boot(x , statistic=myFunc, R=10000)
hist(bootMean$t)

The notation you are using is not "standard" but I understand that you want to better grasp the t-test. In its formula $\frac{\bar X-\mu_0}{S/\sqrt{n}}$, it is important to understand that $\bar X$ represents the random variable of sampling distribution of the sample means, while $\mu_0$ is fixed (which corresponds to our null hypothesis). The denominator $S/\sqrt{n}$ is the standard error of the mean which measures the variability of sample means in the sampling distribution of means. For your understanding, it is better not to manipulate the numerator and the denominator and keep them as is.

Computing probability distributions over bootstrap samples for two statistics

1 Answers1