17

When studying two independent samples means, we are told we are looking at the "difference of two means". This means we take the mean from population 1 ($\bar y_1$) and subtract from it the mean from population 2 ($\bar y_2$). So, our "difference of two means" is ($\bar y_1$ - $\bar y_2$).

When studying paired samples means, we are told we are looking at the "mean difference", $\bar d$. This is calculated by taking the difference between each pair, and then taking the mean of all those differences.

My question is: Do we get the same ($\bar y_1$ - $\bar y_2$) versus its $\bar d$ if we calculated them from two columns of data, and the first time considered it two independent samples, and the second time considered it paired data? I have played around with two columns of data, and it seems that the values are the same! In that case, can it be said that the different names are used for just non-quantitative reasons?

user84756
  • 447
  • 1
  • 4
  • 9

2 Answers2

18

(I'm assuming you mean "sample" and not "population" in your first paragraph.)

The equivalence is easy to show mathematically. Start with two samples of equal size, $\{x_1,\dots,x_n\}$ and $\{y_1,\dots,y_n\}$. Then define $$\begin{align} \bar x &= \frac{1}{n} \sum_{i=1}^n x_i \\ \bar y &= \frac{1}{n} \sum_{i=1}^n y_i \\ \bar d &= \frac{1}{n} \sum_{i=1}^n x_i - y_i \end{align}$$

Then you have: $$\begin{align} \bar x - \bar y &= \left( \frac{1}{n} \sum_{i=1}^n x_i \right) - \left( \frac{1}{n} \sum_{i=1}^n y_i \right) \\ &= \frac{1}{n} \left( \sum_{i=1}^n x_i - \sum_{i=1}^n y_i \right) \\ &= \frac{1}{n} \left( \left( x_1 + \dots + x_n \right) - \left( y_1 + \dots + y_n \right) \right) \\ &= \frac{1}{n} \left( x_1 + \dots + x_n - y_1 - \dots - y_n \right) \\ &= \frac{1}{n} \left( x_1 - y_1 + \dots + x_n - y_n \right) \\ &= \frac{1}{n} \left( \left( x_1 - y_1 \right) + \dots + \left( x_n - y_n \right) \right) \\ &= \frac{1}{n} \sum_{i = 1}^n x_i - y_i \\ &= \bar d. \end{align}$$

shadowtalker
  • 11,395
  • 3
  • 49
  • 109
  • 1
    But two confidence intervals calculated for "the difference of the means" and "the mean difference" will be different, right? This can be seen by looking at $A = [1, 2, 3, 4, 5, ...]$ and $B = [..., 5, 4, 3, 2, 1]$. A paired "mean difference" will be different for $A - A$ (which is all zero) versus $A - B$ (which is not all zero); the difference of the means is unaffected by the order of the elements. – bers Dec 15 '15 at 19:37
  • Can't edit my previous post any longer. The 3rd sentence should begin "A sequence of paired 'mean differences' ..." – bers Dec 16 '15 at 13:07
  • @bers what does $A-A$ have to do with it? – shadowtalker Dec 16 '15 at 13:23
  • Assume $C=A$. Then $A-C$ and $A-B$ are two different sequences. The confidence interval for the mean paired difference will certainly be different in both cases. But the difference of the means, and so it's confidence interval, will be indentical both for $A-C$ and $A-B$. Or am I wrong? – bers Dec 16 '15 at 13:27
  • @bers I think you're confused, but I'm confused as to what you're confused about. – shadowtalker Dec 16 '15 at 13:40
  • Which mean paired difference is "the" mean paired difference you're talking about? – shadowtalker Dec 16 '15 at 13:45
  • I posted a follow-up question to clear the confusion: https://stats.stackexchange.com/questions/187067/difference-in-means-vs-mean-difference-confidence-intervals – bers Dec 16 '15 at 15:09
0

the distribution of the mean difference should be tighter then the distribution of the difference of means. See this with an easy example: mean in sample 1: 1 10 100 1000 mean in sample 2: 2 11 102 1000 difference of means is 1 1 2 0 (unlike samples itself) has small std.

Vlad
  • 101
  • 2
  • This is only the case because your two example vectors are ridiculously correlated and var(X - Y) = var(X) + var(Y) - 2cov(X, Y). – einar Oct 20 '20 at 09:24