When/why does the central tendency of a resampling simulation markedly differ from the observed value?

Question

Should one always expect the central tendency (i.e., mean and/or median) of a bootstrapped sample to be similar to the observed value?

In this particular case I have responses that are distributed exponentially for subjects across two conditions (I didn't run the experiment, I only have the data). I have been tasked with boot strapping the effect size (in terms of Cohen's d, the one-sample formula, i.e. $\bar{M_D}\over{s_D}$ where is the sample estimate of the population standard deviation. The forumla for this is provided in Rosenthal & Rosnow (2008) on pg 398, equation 13.27. They use $\sigma$ in the denominator because it is historically correct, however standard practice has misdefined d as using $s$, and so I follow through with that error in the above calculation.

I have randomized both within participants (i.e. a participants RT may be sampled more than once) and across subjects (participants may be sampled more than once) such that even if participant 1 is sampled twice, their Mean RT in both samples is unlikely to be exactly equal. For each randomized/resampled dataset I recalculate d. In this case $N_{sim} = 10000$. What I'm observing is a trend for the observed value of Cohen's d to be typically closer to the 97.5th percentile of than to 2.5th percentile of simulated observed values. It also tends to be closer to 0 than the median of the bootstrap (by 5% to 10% of the density of the simulated distribution).

What can account for this (keeping in mind the magnitude of the effect I'm observing)? Is it due to it being 'easier' upon resampling to obtain more extreme variances than those observed relative to the extremity of means upon resampling? Might this be a reflection of data that has been overly massaged/selectively trimmed? Is this resampling approach the same as a bootstrap? If not, what else must be done to come up with a CI?

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

Any nonlinear statistic (a non-linear combination of linear statistics such as sample means) has a small sample bias. Cohen's $d$ is obviously no exception: it is essentially $$ d=\frac{m_1 - m_2}{\sqrt{m_3-m_4^2}} $$ which is fairly non-linear, at least as far as the terms in the denominator go. Each of the moments can be considered an unbiased estimator of what it's supposed to estimate: $$ \begin{array}{ll} m_1 & = \frac1{n_1} \sum_{i\in\mbox{group }1} y_i , \\ m_2 & = \frac1{n_2} \sum_{i\in\mbox{group }2} y_i , \\ m_3 & = \frac1{n_1+n_2} \sum_{i} y_i^2 , \\ m_4 & = \frac1{n_1+n_2} \sum_{i} y_i , \\ \end{array} $$ However, by Jensen's inequality there is no way on Earth you get an unbiased estimator of the population quantity out of a nonlinear combination. Thus ${\mathbb{E}}[ d]\neq$ population $d$ in finite samples, although the bias is typically of the order of $O(1/n)$. Wikipedia article on effect sizes mentions the small sample biases in discussion of Hedges' $g$.

I imagine that Cohen's $d$ has a limited range (in the extreme case, if there is no variability within groups, then $d$ must equal $\pm 2$, right?), hence its sampling distribution must be skewed, which contributes to the finite sample biases (some function of the skewness of the sampling distribution is typically the multiplier in front of $1/n$ that I mentioned above). The closer you are to the limits of the allowed range, the more pronounced the skewness is.

What bootstrap does, rather miraculously considering that it is such a simple method, is it gets you the ability to estimate this finite sample bias through comparison of the bootstrap mean and the estimate from the original sample. (Keep in mind though that unless you make special adjustments to how the bootstrap sampling is set up, the former will be subject to Monte Carlo variability.) I provided more detailed and more technical explanations in another bootstrap question which may be worth reading anyway.

Now if there's a positive bias, i.e., the estimate based on the original sample is biased upward relative to the population $d$, then the bootstrap will mock that and produce estimates that are, on average, even higher than the sample estimate. It is not actually as bad as it sounds, as then you can quantify the bias and subtract it from the original estimate. If the original estimate of a quantity was $\hat\theta_n$, and the mean bootstrap of the bootstrap replicates is $\bar\theta^*_n$, then the bias estimate is $\hat b_n=\bar\theta^*_n-\hat\theta_n$, and a bias-corrected estimate is $\hat\theta_n - \hat b_n=2\hat\theta_n - \bar\theta^*_n$.

I already was aware that Cohen's d was a biased statistic. I appreciate the details regarding the reasons why it is biased. Nevertheless, I'm a little skeptical that it is biased to the degree I'm observing. The Wikipedia article doesn't define 'a' in the referenced equation. In addition the referenced equation and yours appear to reference the two sample version of Cohen's d. So, I'm unsure what magnitude of bias I should be expecting in this case and whether your answer covers the difference I'm seeing. — russellpierce, May 12 '14 at 18:57
I'm also unclear on how to combine your last two paragraphs. Bootstrap will allow you to estimate bias but it will also yield results that are more biased than the original sample? — russellpierce, May 12 '14 at 18:59
There is no $a$ in my formulae -- what is the $a$ that you are referring to? I updated the last paragraph to demonstrate how to get bias corrected bootstrap estimates. I am not an expert on effect sizes, and you did not provide any links, so I have used the best information available to me, which was Wikipedia. If 1-sample Cohen's $d$ is any similar, and also nonlinear, then my explanation applies qualitatively. — StasK, May 15 '14 at 06:42
The Hedge's g formula in the linked article uses $a$. I'll update my question to include the one sample Cohen's d reference. It is indeed non-linear. Your response predicts $O(1/n)$ bias, but the observed difference was much more extreme than that, so I do not think your answer covers the issue I'm seeing. I've provided more details above - it may be that I did not properly implement the bootstrap procedure. — russellpierce, May 15 '14 at 14:45
$O(1/n)$ is just the rate. I have seen some pretty silly results where the constant in front of that $1/n$ term was derived (don't get me wrong, this very heavy lifting deriving these constants, more difficult than establishing the rate itself), so the whole thing looked like $1-10^8/n$ for a probability that is supposed to converge to 1. $a$ in the Wikipedia formula is just a dummy index, like $i$ in summation or $x$ in integration; whoever wrote the article just stuck it there to show that $J(a)$ is a shorthand for the ratio of gamma functions. — StasK, May 16 '14 at 06:26
The effect I'm observing is consistent across 12 independent analyses from a single experiment. Is it really plausible that all 12 would be biased with N = 48? If so, is the $2\hat{\theta}-\bar{\theta}^*_n adjustment preferred over "Lunneborg's Method" (http://www.uvm.edu/~dhowell/StatPages/Resampling/BootstMedians/bootstrapping_medians.html) which also forces the resampled distribution to recenter on the observed value? — russellpierce, May 16 '14 at 13:05

When/why does the central tendency of a resampling simulation markedly differ from the observed value?

1 Answers1