Overlapping 95% confidence limits

Question

I cam across these two old blog posts on displayed error bars and tried to work through the result. I believe I am making a mistake somewhere, but I'm not sure where. Let me describe the scenario first, and lay out my reasoning.

First, the scenario:

Suppose that we have a plot with the measurements of a particular quantity $x$ for two different populations, $A$ and $B$. Let us assume that $x$ is Gaussian distributed.

We find that the two measurements have means $\bar{x}_1$, $\bar{x}_2$.

We make a plot of both measurements with their $2\sigma$ confidence limits. For simplicity, let us say that both datasets have the same standard deviation $s$.

The confidence limits overlap to some extent. The posts ask: to what extent do they overlap so that this result significant at the $\alpha = 0.05$ level?

My attempt at answering this

Let us construct the statistic $z = \frac{\bar{x}_1 - \bar{x}_2}{2 s}$.

The standard deviation of $z$ is then $\sigma_z = \sqrt{\left(\frac{\partial z}{\partial \bar{x}_1}\right)^2 s_1^2 + \left(\frac{\partial z}{\partial \bar{x}_2}\right)^2 s_2^2} \quad = \frac{1}{\sqrt{2}}$, since $s_1 = s_2 = s$.

Now, we can rephrase our problem as the search for a value $z_{\star}$ such that $p(-z_{\star} \leq z \leq z_{\star} )= 0.95$, given that $z \sim \mathcal{N}(\mu_z = 0, \sigma_z = \frac{1}{\sqrt{2}})$ -- i.e. the null hypothesis is that $z$ is normally-distributed about a mean of $0$ and standard deviation $1/\sqrt{2}$.

Going to Mathematica, I find that $z_{\star} \approx 1.386$.

To try to interpret what this means, let us now re-write $z = \frac{\bar{x}_1 - \bar{x}_2}{2s} = \frac{\Delta \bar{x}}{w}$, where $w$ is the length of the $2\sigma$ "error bars".

Reaching statistical significance when $|z| > 1.386 \; z_{\star}$ implies that we can have $|\Delta \bar{x}| \leq 1.386 \; w$, so the "error bars" can significantly overlap.

This seems at odds with the statement here that the error bars "can overlap by as much as 25% of their total length and still show a significant difference."

So: where is the gap in my reasoning? (Is it in the interpretation of the standard deviation/standard error in the $t$-test?)

(Btw, I don't think the definition of 95% CLs in these posts is technically correct, with the usual mixing up of Bayesian and Frequentist interpretations. I've tried to avoid this in my question, but let me know if I can be clearer.)

@Dave ah, I meant derivatives wrt $\bar{x}_1$, $\bar{x}_2$. Hope that's now correct! — nonreligious, Aug 05 '21 at 16:51
There are several questions stated here, but the first--and seemingly the primary one, "is this result significant at the α=0.05 level,"--is answered in the duplicate thread. — whuber, Aug 05 '21 at 18:03
@whuber I appreciate the sentiment, but I think my primary question is essentially "where is this reasoning going wrong" rather than what you say. I suspect it is something to do with the interpretation of the standard deviation as the SE on the mean, but I am not sure — nonreligious, Aug 05 '21 at 21:27
Would you mind editing the post so that it states a single clear question? That will help prevent the appearance of multiple different answers once it is reopened. — whuber, Aug 05 '21 at 21:33
@whuber Done, but if it isn't as clear as you'd like, feel free to edit it further before reopening. — nonreligious, Aug 05 '21 at 22:17
If it's not clear, then--being unable to read your mind (at this distance ;-)--I would be in no position to edit it! There, however, seems to be no contradiction: when someone makes a claim about "as much as 25%," they are not ruling out larger values. — whuber, Aug 05 '21 at 22:20
You use the term "confidence limits" to mean plus or minus two standard deviations. This is not the usual usage. Confidence limits are about how precisely you have determined the sample mean, so are based on the standard error of the mean (plus or minus two SEMs is a good approximation if sample size is reasonably large). The SEM will get smaller (so the CI will get narrower) as sample size goes up. Your definition based on SD won't be affected by increasing sample size. — Harvey Motulsky, Aug 13 '21 at 21:37
@HarveyMotulsky Yes, I realize I was sloppy with my language here, and perhaps this was part of my confusion. I tried to reformulate this question as an answer below. — nonreligious, Aug 13 '21 at 22:14

score 0 · Answer 1 · answered Aug 13 '21 at 19:37

Had another go at this, corrected some of the framing of my query and got to the point where there can be up to $29\%$ overlap (to be defined below) between the confidence limits.

1. Re-framing problem:

We have two groups, $1$ and $2$, and we measure some quantity $Q$ for samples of size $n_1$ and $n_2$ from $1$ and $2$ respectively.

Let us assume that the $Q$ measurements for the populations for $1$ and $2$ are Gaussian distributed with (population) means $\mu_i$ and variance $\sigma_i^2$, for $i=1,2$.

We measure the $Q$ of our samples and find mean values $\bar{x}_1$, $\bar{x}_2$ and the Bessel corrected sample variances $s_1^2$, $s_2^2$.

Following standard procedure, we calculate the $95\%$ confidence intervals for the means $\mu_i$, $[\bar{x}_i - w_i, \bar{x}_i + w_i]$.

Here is the question: by how much can these confidence intervals overlap, and still yield a statistically significant difference for the population means at the $\alpha=0.05$ level? (Apologies if there is a factor of two difference from your definition.)

2. Assumptions for tractability:

From this answer, it seems that the general case of this problem is related to the so-called Behrens-Fisher problem. I just want some heuristic feeling for what's happening, so let's assume:

$n_1 \approx n_2 = n$ : the sample sizes are approximately the same
$n \gg 1$ : the sample sizes are quite large (let me say at least $\mathcal{O}(10)$).
$s_1 \approx s_2 = s$ : the sample variances are approximately the same

3. Definition of overlap:

With the previous assumptions, and some foreknowledge that the (half-)widths are the non related to the sample standard deviations $s_i$, we can take the confidence intervals to have the same width, i.e. $[\bar{x}_i - w, \bar{x}_i + w]$.

Let us define the overlap as $r = 1 - \dfrac{|\bar{x}_1 - \bar{x}_2|}{2w}$ when $|\bar{x}_1 - \bar{x}_2| \leq 2w$, and $0$ otherwise.

Then $r$ is just the ratio of the common overlap range to the entire width $2w$.

4. Relation of the width $w$ to $s$

(This should be boilerplate stuff, so I'll be brief) For each population $i$, we define the $t$-statistic

$t_i = \dfrac{\bar{x}_i - \mu_i}{s_i/\sqrt{n_i}} \approx \dfrac{\bar{x}_i - \mu_i}{s/\sqrt{n}}$, with $\nu_i$ degrees of freedom, $\nu_i = n_i - 1 \approx n - 1$.

We want critical values of $t_{\alpha}$ where $\mathrm{Pr}(-t_{\alpha} < t_i < t_{\alpha}) = 0.95$.

With the assumption that $n$ and hence $\nu_i$ are large, we find that $t_\alpha \approx 2$ (more like 1.96, but let's keep that in our pocket).

From the definition of $t_i$, we find that the $95\%$ CIs are then (approximately)

$[\bar{x}_i - 2 s/\sqrt{n}, \bar{x}_i + 2 s/\sqrt{n}]$, and comparing with our usage of $w$ above, we see that $w = 2s/\sqrt{n}$.

5. Statistical significance of the difference

Let us define the difference $\delta = \bar{x}_1 - \bar{x}_2$. Our null hypothesis is that $\delta \sim \mathcal{N}(\mu_{\delta}, \sigma_{\delta})$, with $\mu_{\delta}=0$.

The variance of $\delta$ is $\sigma_{\delta}^2 = \dfrac{\sigma_{1}^2}{n_1} + \dfrac{\sigma_2^2}{n_2} \approx \dfrac{2s^2}{n}$.

Then we have the $t$-statistics $t_{\delta}= \dfrac{\delta}{\sqrt{2s^2/n}}$, with the $\nu_{\delta} = (n_1 - 1) + (n_2 -1) \approx 2 (n-1)$.

Again, assuming $n$ is large, we can find the critical value $t_{\delta,\alpha} \approx 2$ such that $\mathrm{Pr}(t_{\delta,\alpha} < t_{\delta} < t_{\delta,\alpha}) = 0.95$, and then invert this to find that a $95\%$ CL for $\mu_{\delta}=0$ is given by $[ -2\sqrt{2s^2/n} ,\; 2\sqrt{2s^2/n}]$.

6. Back to the overlap

If I'm not making some common fallacy here, the above implies that finding $ |\bar{x}_1 - \bar{x}_2| = | \delta | > 2\sqrt{2s^2/n}$ would be statistically significant at the $\alpha=0.05$ level.

In 4., we saw that $w = 2s/\sqrt{n}$, and using the definition of the overlap ratio in 3., the above relation then implies

$2w(1-r) = |\bar{x}_1 - \bar{x}_2| > w \sqrt{2}$, which we can rearrange to find that, if the overlap ratio

$r < 1 - \dfrac{1}{\sqrt{2}} \approx 0.29$,

then the difference between the groups is statistically significant.

I'd be happy to accept that e.g. the difference between $1.96$ and $2$ in the determination of the critical values of the $t$-distribution may be the difference between this $29\%$ value and the $25\%$ claim, but also happy for any criticism.

Overlapping 95% confidence limits

1 Answers1