8

Let $n\geq 2$ i.i.d. normally distributed variables $s_i\sim\mathcal{N}\left(0,\sigma^2\right)$, with $i\in\left\{1,2,\dots,n\right\}$.

I draw two samples of $k<n$ variables, without replacement. The set corresponding to the first (second) draw is denoted by $\Phi_1$ ($\Phi_2$).

What is the expectation of: $$\left| \frac{1}{k}\sum_{i\in\Phi_1} s_i - \frac{1}{k}\sum_{i\in\Phi_2} s_i \right|,$$ where $\left|x\right|$ is the absolute value of $x$?

  • 2
    Your setup is sufficiently complicated that we ought to suspect you have in mind the possibility that the $s_i$ are not independent; for if they were, you wouldn't need any more than one variable to describe this situation. We also should suspect you want the $\sigma^2$ to vary with $i,$ for if not, all it does it multiply the answer by $|\sigma|$ and you might as well assume $\sigma^2=1.$ Could you explain? – whuber Jun 17 '20 at 19:06
  • I want $s_i$ to be independent (since the average might contain overlapping variables and therefore are generally not independent). I agree $\sigma$ can be normalized. Thanks! – Marius Andrei Zoican Jun 17 '20 at 19:08
  • 2
    In that case the quantity between absolute value signs has a $\mathcal{N}(0,2\sigma^2/k)$ distribution, greatly simplifying your question. – whuber Jun 17 '20 at 19:11
  • Is that generally true? Elements in $\Phi_1$ and $\Phi_2$ might overlap. At the limit, if $k=n$ then both sets contain the same elements and the value is zero. My intuition is that the expression should decrease in $\frac{k}{n}$. – Marius Andrei Zoican Jun 17 '20 at 19:17
  • 2
    That explains the complication, then: you aren't taking samples of numbers (as one would understand your title), but literally are sampling the variables, as you state in the question. That clears up my misunderstanding, thank you. But since you're sampling with replacement, why limit $k$ to values less than $n$? (Indeed, why not even allow $n=1$?) – whuber Jun 17 '20 at 19:26
  • The background for the question is that I'm measuring disagreement between economic agents who can process $k$ signals from a universe of $n$. They might have some common signals, but not necessarily every one of them. In the limit if $k=n$ there is no disagreement. – Marius Andrei Zoican Jun 17 '20 at 19:31
  • 1
    Did you perhaps mean to sample *without* replacement, then? I'm unsure, because the relationships among agents, disagreement, and the signals are unclear in your (necessarily) brief description. – whuber Jun 17 '20 at 19:34
  • 1
    Correct! I was thinking that sampling is with replacement between the samples, but without replacement inside a sample. Thank you! – Marius Andrei Zoican Jun 17 '20 at 19:40
  • I have started this way. The probability that an element from the second sample is already in the first is $\dfrac{k}{n}$. If $p$ elements overlap between the two samples (and $k-p$ wash out), then the difference is distributed as $\mathcal{N}\left(0, 2\sigma^2\frac{k-p}{k^2}\right)$. The expectation of the absolute value is therefore $2\frac{\sigma}{k}\sqrt{k-p}$. The next step is to consider different values for the overlap $p$ and sum over them: $$ \frac{2\sigma}{k} \sum_{p=0}^k \binom{k}{p} \left(\frac{k}{n}\right)^p \left(1-\frac{k}{n}\right)^{k-p} \sqrt{k-p} $$. Closed form? – Marius Andrei Zoican Jun 17 '20 at 20:04
  • Does "sampling without replacement" apply only within a sample, or across both samples? For instance, could there be some variable that is present in both samples? – D.W. Jun 18 '20 at 07:07

3 Answers3

9

Let's take $\sigma=1$ and ignore the division by $k;$ these simplifications will require us to multiply the answer by $|\sigma|/k$ (which I leave up to you). Thus we seek the expectation of $\left|Z(n,k)\right| $ where

$$Z(n,k) = \sum_{i\in\Phi_1} s_i - \sum_{i\in\Phi_2}s_i.$$

Because $-s_i$ and $s_i$ have the same distribution, the expression inside the absolute value has the same distribution as

$$\sum_{i\in\Phi_1\oplus\Phi_2}s_i$$

(writing $\Phi_1\oplus\Phi_2$ for the symmetric difference $\Phi_1\cup \Phi_2 \setminus \left(\Phi_1\cap\Phi_2\right)$), because the values in the intersection $\Phi_1\cap\Phi_2$ cancel out in the definition of $Z(n,k).$

Conditional on $(\Phi_1,\Phi_2),$ since $Z$ is the sum of independent Normal variables, its distribution is Normal with mean $0$ and variance $2(k-j)$ where $j$ is the cardinality of $\Phi_1\cap\Phi_2.$ (Notice that the component for $j=k$ is singular: it is an atom at $0.$)

Consequently, the distribution of $Z$ is a mixture of these Normal distributions. The weights in the mixture are the chances of $j$ given by the hypergeometric distribution

$$\Pr(|\Phi_1\cap\Phi_2|=j) = \frac{\binom{k}{j}\binom{n-k}{k-j}}{\binom{n}{k}} =: p_{n,k}(j).$$

The distribution of $|Z(n,k)|$ thus is a mixture of variables $Z_j(k),$ $j=0, 1, \ldots, k,$ that are $\sqrt{2(k-j)}$ times (independent copies of) $\chi(1)$ variables. Its expectation therefore is

$$E\left[\left|Z(n,k)\right|\right] = \sum_{j=0}^k p_{n,k}(j) \sqrt{2(k-j)} \sqrt{2/\pi} = \frac{2}{\sqrt{\pi}} \sum_{j=0}^k \sqrt{k-j}\, p_{n,k}(j).$$

As a test, we may simulate many values of $Z(n,k)$ directly from either of the first two formulas and compare their distribution to the mixture. Here, for instance, is the cumulative distribution of $5000$ simulated values on which the mixture CDF is overplotted in red:

Figure 1

The agreement is excellent.

Finally, with the formula for the expected absolute value available, we may plot $E\left[\left|Z(n,k)\right|\right]$ for $k=0, 1, \ldots, n.$ Here is a plot for larger $n:$

Figure 2


Remarks

This analysis readily extends to the case where $\Phi_1$ and $\Phi_2$ are of different sizes $k_1$ and $k_2:$ replace $2(k-j) = \left|\Phi_1\oplus\Phi_2\right|$ by $(k_1-j)+(k_2-j)$ at the outset and use

$$p_{n;k_1,k_2}(j)=\Pr\left(\left|\Phi_1\cap\Phi_2\right| = j\right) = \frac{\binom{k_1}{j}\binom{n-k_1}{k_2-j}}{\binom{n}{k_2}}$$

for the mixture weights, taking the sum over all $j$ for which the binomial coefficients are nonzero.

The atom (discrete component) in the distribution of $Z$ occurs only when $k_1=k_2=k.$ Its weight is the chance of complete cancellation where $\Phi_1=\Phi_2,$ given by $$p_{n,k}(k) = 1/\binom{n}{k}.$$ In the figure (showing the CDF), this is the height of the vertical jump at $Z=0,$ there equal to $1/\binom{5}{3}=1/10.$

We could even go so far as to choose fixed coefficient vectors $\alpha_i$ and $\beta_i,$ let the $s_i$ have an arbitrary distribution (with possibly nonzero mean), and consider

$$Z(n,k;\alpha,\beta) = \sum_{i\in\Phi_1}\alpha_i s_i + \sum_{i\in\Phi_2}\beta_i s_i.$$

The question concerns the case $\alpha_i=1/k$ and $\beta_i=-1/k$ for all $i.$ The preliminary simplification of factoring out the common factor of $1/k$ is no longer available, but the analysis doesn't essentially change: the strategy of conditioning on $(\Phi_1,\Phi_2)$ and breaking the union of the samples into $\Phi_1\setminus\Phi_2,$ $\Phi_2\setminus\Phi_1,$ and $\Phi_1\cap\Phi_2$ still works. I leave the algebraic complications to the interested reader.


Appendix

Here is R code for the simulation in the first figure:

n <- 5
k <- 3
#
# Random draws of Z
#
set.seed(17)
Z <- replicate(5e3, {
  x <- rnorm(n)
  i1 <- sample.int(n, k)
  i2 <- sample.int(n, k)
  sum(x[i1]) - sum(x[i2])                          # Original formula
  # sum(x[setdiff(union(i1,i2), intersect(i1,i2))])# Second formula
})
#
# CDF of Z
#
pf <- function(x, n, k) {
  lp <- function(j) lchoose(k,j) + lchoose(n-k,k-j) - lchoose(n,k)
  z <- sapply(0:k, function(j) exp(lp(j) + pnorm(x, 0, sqrt(2*(k-j)), log=TRUE)))
  rowSums(matrix(z, ncol=k+1))
}
#
# Plots
#
plot(ecdf(Z), main=paste0("Simulated values of Z(",n,",",k,")"),
     cex.main=1, xlab="Z", ylab="Probability")
curve(pf(x, n, k), xlim=c(min(Z), -1e-15), add=TRUE, col="Red", lwd=2, n=1001)
curve(pf(x, n, k), xlim=c(1e-15, max(Z)), add=TRUE, col="Red", lwd=2, n=1001)

Here is R code for the second figure, showing the direct calculation of the expectation:

eZ <- Vectorize(function(n, k) {
  p <- function(j) exp(lchoose(k,j) + lchoose(n-k,k-j) - lchoose(n,k))
  j <- 0:k
  2 / sqrt(pi) * sum(sqrt(k-j) * p(j))
}, "k")

n <- 25
plot(0:n, eZ(n, 0:n), type="h", ylab="Value",
     main=expression(E*group("[", list(italic(Z)(25,k)), "]")), cex.main=1,
     bty="n", xlab=expression(italic(k)))
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 2
    I suspect the author might mean that sampling is without replacement within each individual sample, but the two samples need not be disjoint. If so, where you write $i \in \Phi_1 \cup \Phi_2$, I suspect that might need to be $i \in \Phi_1 \cup \Phi_2 \setminus (\Phi_1 \cap \Phi_2)$. If $i \in \Phi_1 \cap \Phi_2$, then you add $s_i$ in the first sum and subtract $s_i$ in the second sum, and those two cancel out. Or maybe "without replacement" spans both samples, not just each sample separately? – D.W. Jun 18 '20 at 07:08
  • In the context of the rest of the answer I think your first suggestion (i.e. the set typo) is correct. – rwolst Jun 18 '20 at 10:19
  • @D.W. Thank you for the comment. In fact, upon re-reading this post the omission of the "$\setminus \Phi_1\cap\Phi_2$" term was immediately obvious and I have corrected it. (The commented-out line in the code correctly implements the symmetric difference of the samples and, when run, produces equivalent results.) – whuber Jun 18 '20 at 13:19
2

Suppose $n = 100, k = 80.$ Then it makes a difference whether sampling is with or without replacement.

set.seed(2020)
x = rnorm(100, 50, 8)
a = mean(x);  a
[1] 50.87113
sd(x);  sd(x)/sqrt(100)
[1] 8.954334   
[1] 0.8954334  # aprx SE mean

The population SD is $\sigma = 8.$ The reference sample of 100 has $S = 8.954,$ so the SE mean estimated from the reference sample is $S/\sqrt{n} = 0.8954.$

a.wo = replicate(10^5, mean(sample(x,80)) )
sd(a.wo)
[1] 0.4467356  # aprx SE mean w/o replacement
a.wr = replicate(10^5, mean(sample(x,80, rep=T)) )
sd(a.wr)
[1] 0.99378    # aprx SE mean with replacement

Means of subsamples taken without replacement are less variable than means of subsamples taken with replacement. As the available pool of values decreases so does the variability. Also, means of subsamples taken with with replacement get more variable as the size of the subsample decreases (as for $k=50$ below).

a.wr.50 = replicate(10^5, mean(sample(x,50, rep=T)) )
sd(a.wr.50)
[1] 1.262685

Now for a second vector of $100\,000$ such averages of subsamples of size $k=80.$

a.wr2 = replicate(10^5, mean(sample(x,80,rep=T)))
sd(a.wr2)
mean(abs(a.wr - awr2))
a.wr2 = replicate(10^5, mean(sample(x,80,rep=T)))
sd(a.wr2)
[1] 0.9945862
mean(abs(a.wr - a.wr2))
[1] 1.121448

As I interpret your question, the last result above approximates the answer to your question for $n = 100, k = 80$ and sampling with replacement for two independent samples.

If that is correct, it seems worthwhile to try to get an analytic solution for $Var(\frac{1}{k}\sum_i X_i)$ and from there the variance of the absolute difference of two such averages.

BruceET
  • 47,896
  • 2
  • 28
  • 76
0

I have started this way: The probability that an element from the second sample is already in the first is $\dfrac{k}{n}$.

If $$ elements overlap between the two samples (and consequently $−$ wash out), then the difference is distributed as $\mathcal{N}\left(0,2\frac{\sigma^2}{k^2}\left(k-p\right)\right)$. The expectation of the absolute value is therefore $2\frac{\sigma}{k}\sqrt{−}$.

The next step is to take the expectation over different overlap levels $p$: $$\frac{2\sigma}{k} \sum_{p=0}^k \binom{k}{p} \left(\frac{k}{n}\right)^p \left(1-\frac{k}{n}\right)^{k-p} \sqrt{k-p}$$.

Does this have a closed form?

  • This is indeed a good start. I was in the process of posting a complete solution when this appeared. No, the sum (once it is fixed to reflect the correct probabilities of $p$) has no closed form in general--the square root precludes that. – whuber Jun 17 '20 at 21:09