Effect size and bootstrapping in paired t-test

Question

I have multiple paired $t$-tests, such as one giving results:

$t_{14} = 2.7,\ p = .017$

Although people seem to do effect sizes in different ways in repeated samples, I have taken the mean difference divided by the standard deviation of the differences (I'll call this $d$, though maybe I should call it something else?) and get $0.70$. I also have a very strong correlation between the samples, not sure if that is problematic.

I would like to put confidence limits around my effect size estimate. To do so, I randomly resample from the difference scores, compute $d$ in the same way and repeat 1000 times. My question is whether this is a good approach, rather than, say, just giving confidence limits around the unstandardised difference or resampling from the original samples. My bootstrap gives me a mean $d$ of $0.79$ with confidence limits of $[0.4, 1.4]$. I've tried this on other random data too. Why am I getting a consistently higher $d$ from bootstrapping, and why are the intervals asymmetric? Is this because of skew in the (difference) scores, and does this make this approach more or less robust?

Edit: here is an example of the data involved. 15 people were measured two times.

Mean A = 1742; SD = 435
Mean B = 1820; SD = 426
Mean difference = 78, SD of differences = 111, $d$ = 0.70

Just to say that I have found useful material on these pages (though I haven't got a specific answer to the case of bootstrapping CIs for an effect size) http://stats.stackexchange.com/questions/71525/critical-effect-sizes-and-power-for-paired-t-test?rq=1 , http://stats.stackexchange.com/questions/73818/reporting-operative-effect-in-paired-t-test?rq=1 — splint, Nov 30 '15 at 15:45
I'm not quite sure I'm following this. Can you give a simple example / some example data? Is this a multiple comparisons issue? — gung - Reinstate Monica, Dec 03 '15 at 14:59
@gung Thanks for looking. The t quoted is a simple example though I can fish out some data if you want. The issue is not about multiple comparisons. It is about (1) how to calculate effect sizes in a paired t-test; (2) whether it makes sense to bootstrap a confidence interval around this; and (3) why this interval might be asymmetric. — splint, Dec 04 '15 at 14:16
What are the "repeated samples" that supposedly lead people to "do effect sizes in different ways"? For people here to get a sense of why the mean of your bootsamples is different & the CI is asymmetric, you will probably need to paste your data & your code. — gung - Reinstate Monica, Dec 04 '15 at 16:51
I have added some data. Is there a better way to do tables on here? For background on the different ways to calculate effect sizes in repeated measures, see the links in my first comment (essentially, some prefer to use the pooled SD as a denominator rather than the SD of the difference scores). — splint, Dec 04 '15 at 17:47
For what it's worth, I had this same theoretical issue some months ago (i.e. trying to determine the correct denominator for calculating the effect size for a paired t-test); from my research into the literature and from fiddling around with some simulated data, I found that using the standard deviation of the difference scores as the denominator to be preferable to using the pooled standard deviation. At the moment I can't find the specific references I used, but I will try to find them and post links when I get the chance. — Ryan Simmons, Dec 05 '15 at 16:34
@Ryan, using pooled standard deviation (if you mean pooled across two groups) does not make any sense at all. The variation inside each group can be huge, but all pairwise differences can be close to zero. One clearly needs to use the standard deviation of the differences. — amoeba, Dec 05 '15 at 21:53
@amoeba ... which is exactly what I said? I said that it is preferable to use the standard deviation of the differences ... — Ryan Simmons, Dec 07 '15 at 18:42
@Ryan, yes, I did not contradict what you said, I just thought that "preferable" is too weak a word to describe it :-) — amoeba, Dec 07 '15 at 19:31
@amoeba Fair enough! I said "preferable" only because I couldn't remember the exact justification so didn't want to come on too strong. But thanks for the clarification! — Ryan Simmons, Dec 07 '15 at 21:07

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

I will attempt to answer but I am not totally sure on my own knowledge on the subject.

Bootstrap, as far as I know is always done on the original data. In your case the original data is pairs of data. So to do a bootstrap, you would have to random sample (with replacement) on the pairs of the original data. That is equivalent to do the bootstrap on the difference scores and performing the effect size calculation as you described on the samples.

I get a different result from you (in R)

a=read.table(header=F,text="
1999 2040
1501 1601
1552 1623
2385 2386
2488 2671
1257 1218
1806 1719
1348 1405
2048 2079
1810 2017
1308 1356
2310 2324
1247 1616
1839 1878
1235 1370
")
d=a$V2-a$V1
mean(d)/sd(d)
[1] 0.7006464
aux=function(x,i) mean(x[i])/sd(x[i])
bb=boot::boot(d,aux,R=1000)
mean(bb$t)
[1] 0.7530415
boot::boot.ci(bb)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot::boot.ci(boot.out = bb)

Intervals : 
Level      Normal              Basic         
95%   ( 0.1840,  1.0846 )   ( 0.1454,  1.0570 )      

Level     Percentile            BCa          
95%   ( 0.3443,  1.2559 )   ( 0.1634,  1.0722 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

(code corrected as per the comments)

Indeed the direct calculation of the effect size (mean(d)/sd(d)) is not similar to the bootstrap calculation (mean(bb$t)). I dont know how to explain it

The only confidence interval that matches yours in the percentile (I dont really know which interval to choose on theoretical grounds - I use the BCa - I think it was suggested somewhere)

The second way to calculate a CI on effect size is to use analytical formulas. This question on CV discussed the formulas How can i calculate the 95% confidence interval of an effect size if I have the mean difference score, CI of that difference score

Using the MBESS package I get the following CI

MBESS::ci.sm(Mean = mean(d), SD=sd(d),N=length(d))
[1] "The 0.95 confidence limits for the standardized mean are given as:"
$Lower.Conf.Limit.Standardized.Mean
[1] 0.1231584

$Standardized.Mean
[1] 0.7006464

$Upper.Conf.Limit.Standardized.Mean
[1] 1.258396

As for your suggestion on computing the confidence interval for the difference score and using it to compute a confidence interval on the effect size, I have never heard of it, and I would suggest not using it.

+1 to @amoeba, I think you want to use `mean(bb$t)`. Nice answer; +1 as soon as you fix that issue. — usεr11852, Dec 06 '15 at 08:16
Very helpful answer. I do have questions though. `mean(bb$t)` returns 0.76 which, as in my example, is considerably greater than the sample value. All of the intervals are also asymmetric whereas my understanding was that this should not be the case for analytically computed CIs. — splint, Dec 06 '15 at 12:07
There is a missing close bracket in the function/mean call, apparently edits of 1 character are not allowed! — splint, Dec 06 '15 at 12:10
thanks folks. mean(bb$t) and the ")" in the mean corrected. Indeed the values for the bootstrap mean and the full data effect size are not close. I dont know how to explain. — Jacques Wainer, Dec 07 '15 at 13:44
@splint and Jacques: what happens here has to do with bias and bias correcting in bootstrap, see e.g. [on wikipedia](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Deriving_confidence_intervals_from_the_bootstrap_distribution). I am not a specialist, but rougly what happens is that the difference between your empirical value 0.7 and your bootstrapped value 0.75 indicates a bias. You can correct this bias, by subtracting this difference from 0.7 and arrive to the bias-corrected estimate of d as 0.65. The intuition is that if your bootstrapped samples were on average 0.05 higher [ctd.] — amoeba, Dec 07 '15 at 14:00
... than the empirical value then the empirical value is also likely to be higher than the population value. The different confidence intervals are various ways to compute a confidence interval about that bootstrapped-corrected value. I think that `BC` stands for bias-corrected and note that that interval is approximately symmetric around 0.65. — amoeba, Dec 07 '15 at 14:02

Effect size and bootstrapping in paired t-test

1 Answers1

Linked