Bootstrapping to Test for Homogeneity of Variance between Samples

Question

BACKGROUND: Before performing a t-test to compare means between two samples it is common to check whether the variances can be assumed to be the same. Usually this is done with either an $F$ or a Levene's test. However, there is (apparently) a strong dependency on normalcy in the underlying data. Bootstrapping is suggested as a better alternative in a Coursera course.

QUESTION: I am considering a simple dataset put together by just subsetting the miles per gallon in the mtcars dataset and comparing cars with four and eight cylinders:

(four_cyl    <- mtcars$mpg[mtcars$cyl==4])
[1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
(eight_cyl   <- mtcars$mpg[mtcars$cyl==8])
[1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0

If I run an $F$ test, there doesn't seem to be any reason to conclude that the variances are different:

var.test(four_cyl, eight_cyl)   
F test to compare two variances

    data:  four_cyl and eight_cyl
    F = 3.1033, num df = 10, denom df = 13, p-value = 0.05925
    alternative hypothesis: true ratio of variances is not equal to 1
    95 percent confidence interval:
      0.9549589 11.1197133
    sample estimates:
    ratio of variances 
              3.103299

I also ran a Levene's test but I had problems with the size of the samples being unequal: library(reshape2); library(car); sample <- as.data.frame(cbind(four_cyl, eight_cyl)); dataset <- melt(sample); leveneTest(value ~ variable, dataset) yielding a $p$ value of 0.1061.

Now if I try to put together some bootstrapping attempt, the visual plotting doesn't seem to be so reassuring:

dat <- replicate(1e4, sample(four_cyl, length(four_cyl), replace = T))
sd_4 <- apply(dat, 2, sd)
hist(sd_4, freq = F, col="moccasin", ylim=c(0,1), border = F)

dat <- replicate(1e4, sample(eight_cyl, length(eight_cyl), replace = T))
sd_8 <- apply(dat, 2, sd)
hist(sd_8, add=T, freq = F, col="navajowhite4", border = F)

However, I don't know how to test the difference shown in the bootstrap simulation numerically. The temptation is to run a t-test: t.test(sd_4, sd_8), which yields a p-value = < 2.2e-16. But this doesn't make too much sense...

Perhaps its possible to just run the ratios of the different observed $10,000$ bootstrap samples, and calculate a Wald interval on them:

boot_ratios <- sd_4/sd_8
mean(boot_ratios) + c(-1, 1) * qnorm(0.975) * sd(boot_ratios)
[1] 0.7533392 2.9471311

? - this is not a typo...

This would include $1$ and hence be concordant with the failure to reject $H_o$ with the $F$ test.

ADDENDUM: Following @Scortchi comment I calculated the quantiles as:

quantile(boot_ratios, c(0.025, 0.975))
    2.5%    97.5% 
1.075067 3.243562

and they don't include $1$.

Is this procedure overall sound? How can interpret the results?

I'd imagine they mean using the bootstrap to estimate the distribution of the *test statistic* under the null. Something along the lines of resampling from the combined distribution of centred observations. (BTW I don't think the F test for equal variances is usually called the "Fisher test", & Levene's test is supposed to be quite robust.) — Scortchi - Reinstate Monica, Oct 22 '15 at 13:21
@Scortchi: I found a sentence that may contain the key on a slide: "Alternatively, for moderate sample sizes, consider creating a bootstrap confidence interval for the ratio of the two variables." Does this help? — Antoni Parellada, Oct 22 '15 at 13:53
That's an alternative. You could also directly find the 0.025 & 0.975 percentiles of the bootstrap ratios. — Scortchi - Reinstate Monica, Oct 22 '15 at 16:29
The approach of "*test for equality of variance then if you don't reject, use a t-test that assumes equality of variance otherwise use one that doesn't assume equality of variance*" is in general not as good as the much simpler approach "*if you're not in a position to assume the variances are equal, just don't assume the variances are equal*" (i.e. if you were going to use say a Welch test if you rejected the equality of variance test, *just use the Welch test without testing variance first*). There's a number of posts on site that advise against such "preliminary" testing, with references. — Glen_b, Oct 22 '15 at 22:22
(ctd) ... For example, there's a reference [here](http://stats.stackexchange.com/questions/97098/practically-speaking-how-do-people-handle-anova-when-the-data-doesnt-quite-mee/97120#97120) and a link to a different answer with other references. One nice thing about a number of stats packages these days (notably R but also a number of other packages) is that they take the usually sensible default of not assuming equality. — Glen_b, Oct 22 '15 at 22:27
@Glen_b Thank you very much, Glen. I'll read the link posted. I presume that you didn't see any gross misunderstandings in the bootstrapping experiment itself. — Antoni Parellada, Oct 22 '15 at 22:30
I didn't look closely (I expect your implementation is okay). However, as a general principle, I would tend to have concerns about the properties of the bootstrap as such small sample sizes (I'd want to see that it would give me the sort of properties I seek; so if one is using CIs as the basis of a test, do the CI's at such sample sizes have close to the desired coverage level under some plausible assumptions? I'm not especially fussy about significance levels being off a bit, but if I get effective rejection rates under the null that are quite far from what I think I would get, it's a worry) — Glen_b, Oct 22 '15 at 22:38

Scortchi - Reinstate Monica · Accepted Answer · 2015-11-27T12:21:32.183

A test proper could be devised by assuming that the two groups differ only in location $\mu$ & scale $\sigma$, so that their distribution functions are $$F_1(x_1) = F((x_1-\mu_1)/\sigma_1)$$ & $$F_2(x_2) = F((x_2-\mu_2)/\sigma_2)$$; then under the null hypothesis of a common scale $\sigma=\sigma_1=\sigma_2$, the centred variable $Y= X_i-\mu_i$ has the distribution $$F_\mathrm{c}(y) = F(y/\sigma)$$, which can be approximated by the empirical distribution function $$\hat F_\mathrm{c}(y) = \frac{\sum_{i=1}^{2}\sum_{j=1}^{n_i} I(x_{ij} - \bar x_i \leq y)}{n_1+n_2}$$ where $I$ is the indicator function.

So the bootstrap procedure would resample from the pooled differences between each observation & the mean of its group, & compare the distribution of the test statistic, say the usual F statistic, whose distribution shouldn't depend overmuch on the unknown characteristics of $F_c(\cdot)$, with the value in fact observed.

(This is a rather off-the-cuff answer—it illustrates a basic bootstrap test, but I daresay there are better methods to use in practice.)

Here's some code to test it:

# sample sizes
n1 <- 12
n2 <- 20
# location parameters
mu1 <- 3
mu2 <- 6
# scale parameters (alt. hyp sigma1 > sigma2)
sigma1 <- 1
sigma2 <- 1
# distribution function, e.g. Student's t with 3 degrees of freedom
rdist <- function(n) rt(n, df=3)

# no. simulations to perform
no.sims <- 1000
# no. bootstrap samples to take in each simulation
no.boots <- 1000

# initialize vector of p-values - for normal F test
p.normFtest <- numeric(no.sims)
# initialize vector of p-values - for bootstrap test
p.bsFtest <- numeric(no.sims)

# simulate!
for (j in 1:no.sims){
  # simulate samples
  rdist(n1)*sigma1 + mu1 -> x1
  rdist(n2)*sigma2 + mu2 -> x2
  # calculate observed test statistic
  var(x1)/var(x2) -> F.obs
  # calculate its p-value
  1-pf(F.obs,n1-1,n2-1) -> p.normFtest[j]
  # initialize vector of test statistics
  F.boot <- numeric(no.boots)
  # define bootstrap population
  c(x1-mean(x1),x2-mean(x2)) -> boot.pop
  # bootstrap!
  for (i in 1:no.boots){
    # 1st sample
    x1.boot <- sample(boot.pop, n1, replace=T)
    # 2nd sample
    x2.boot <- sample(boot.pop, n2, replace=T)
    # calculate bootstrap test statistic
    var(x1.boot)/var(x2.boot) -> F.boot[i]
  }
# estimate bootstrap p-value
sum(F.boot >= F.obs)/no.boots -> p.bsFtest[j]
}

# examine distributions of p-values (should be uniform under null)
plot(ecdf(p.normFtest), col="black", do.points=F, verticals=T, xlab="calculated p-value", ylab="simulated distribution function", main="")
plot(ecdf(p.bsFtest), col="red", add=T, do.points=F, verticals=T)
abline(a=0,b=1, col="grey", lty="dashed")
legend("bottomright", legend=c("normal", "bootstrapped"), lty=1, col=c("black","red"))

So in this particular case this bootstrap F test maintains size rather better than the F test based on an incorrect assumption of normality. It'd be interesting to compare it with more robust tests like Levene's^†, & to examine its power vs the normal F test when the normality assumption is correct. And using a double bootstrap might help to keep the test's size closer to the nominal level.

† With the commonly used Brown–Forsythe modification.

Bootstrapping to Test for Homogeneity of Variance between Samples

1 Answers1