Effect of adding and removing data on variance

Question

Consider a set of distinct numbers. After removing both the max and the min from the set and adding the median to the set, the set of numbers obviously becomes less dispersed and the variance should decrease. How do we prove this result formally?

I have attempted to work with the definition and the expansion seems a mess. It does not appear a feasible approach at all.

Thank you very much for @whuber♦'s detailed explanation. But I actually meant to include the median once only, so that we are moving from a set of $n$ numbers to a set of $n-1$ numbers. I attempted to follow your argument and consider the $X=\{{{x}_{2}},\cdots ,{{x}_{n-1}}\}$, $Y=\{{{x}_{1}},{{x}_{50}}\}$ and $Y'=\{{{x}_{M}}\}$, where ${{\mu }_{1}}(X,Y)$ is assumed to 0. Then I obtained $\begin{align} & {{\Delta }_{X}}(Y,Y')=\operatorname{Var}(X,Y)-\text{Var}(X,Y') \\ & =[{{\mu }_{2}}(X,Y)-{{\mu }_{1}}{{(X,Y)}^{2}}]-[{{\mu }_{2}}(X,Y')-{{\mu }_{1}}{{(X,Y')}^{2}}] \\ & =\left[ \frac{\sum\limits_{i=1}^{n}{x_{i}^{2}}}{n}-0 \right]-\left[ \frac{x_{M}^{2}+\sum\limits_{i=2}^{n-1}{x_{i}^{2}}}{n-1}-{{\left( \frac{{{x}_{M}}+\sum\limits_{i=2}^{n-1}{{{x}_{i}}}}{n-1} \right)}^{2}} \right] \\ & =\frac{\sum\limits_{i=1}^{n}{x_{i}^{2}}}{n}-\frac{x_{M}^{2}+\sum\limits_{i=2}^{n-1}{x_{i}^{2}}}{n-1}+\frac{{{({{x}_{M}}-{{x}_{1}}-{{x}_{n}})}^{2}}}{{{(n-1)}^{2}}} \\ & =\frac{(n-1)\sum\limits_{i=1}^{n}{x_{i}^{2}}-nx_{M}^{2}-n\sum\limits_{i=2}^{n-1}{x_{i}^{2}}}{n(n-1)}+\frac{{{({{x}_{M}}-{{x}_{1}}-{{x}_{n}})}^{2}}}{{{(n-1)}^{2}}} \\ & =\frac{(n-1)(x_{1}^{2}+x_{n}^{2})-nx_{M}^{2}-\sum\limits_{i=2}^{n-1}{x_{i}^{2}}}{n(n-1)}+\frac{{{({{x}_{M}}-{{x}_{1}}-{{x}_{n}})}^{2}}}{{{(n-1)}^{2}}} \end{align}$

It is not obvious to me how this messy expression can be further simplified in order to establish its non-negativity. Would you mind pointing it out? Thank you.

What are some of the common tricks that can be employed for proving claims about the variance (not just for this claim)? Direct expansion does not seem promising to me.... — Jigoku Shoujo, Aug 11 '18 at 15:52
Re the edit: The whole point of the notation I adopted is to avoid these messy summations, making it easier to find a solution. — whuber, Aug 12 '18 at 13:43

whuber · Answer 1 · 2018-08-12T13:42:27.903

Breaking the problem down into conceptually distinct parts makes it soluble. This is one general approach to dealing with variances.

Framing the problem

We may view the data as a collection of numbers (which needn't be distinct) $x_1, x_2, \ldots, x_n$ where $n\ge 2,$ $x_1$ is the smallest, and $x_n$ is the largest. Partition this into two groups: $X=(x_2, x_3, \ldots, x_{n-1})$ of $n-2$ numbers and $Y=(x_1,x_n)$ of $2$ numbers. The problem asks what happens when $Y$ is replaced by $Y^\prime=(x_M)$ where $x_M$ is the median of all $n$ values.

Simplifying the problem with preliminary manipulations and inequalities

Because the variance does not change when all values are shifted, we may assume the midrange of all the values is $0$: that is, $x_1 = -x_n.$

Choose a unit of measure in which $x_n=1.$ This is always possible unless $x_1=x_n=0,$ in which case it is obvious that the initial and final variances are both zero.

Note that since now all values lie between $-1$ and $1$, the median $x_M$ and the mean $\bar x$ of $X$ also lie between $-1$ and $1.$ That is,

$$-1=x_1 \le \bar x \le x_n = 1.\tag{*}$$

Also, the variance $\sigma^2$ of $X$ cannot exceed $1.$ Let $\delta=x_M-\bar x$ be the difference between the median and mean. It is a simple exercise to show that

$$\delta^2 \le \sigma^2 \le 1.\tag{**}$$

Familiar Definitions

Moments

The (raw) moment of degree $k$ of a collection of numbers is the arithmetic mean of their $k^\text{th}$ powers. For convenience, when $Z$ denotes a collection of numbers $(z_1, z_2, \ldots, z_m),$ let's write

$$\mu_k(Z) = \frac{1}{m}\sum_{i=1}^m z_i^k$$

for their $k^\text{th}$ moment.

Behavior of moments under partitioning

When a collection of numbers $Z$ is partitioned into $Z = (X,Y)$ with $X=(x_1,\ldots,x_n)$ and $Y=(y_1,\ldots,y_m),$ the foregoing formula breaks into two separate sums, yielding

$$\mu_k(Z) = \frac{1}{n+m}\left(n\,\mu_k(X) + m\,\mu_k(Y)\right).$$

Variance

One formula for the variance of a set of numbers is their second moment minus the square of their first moment:

$$\operatorname{Var}(Z) = \mu_2(Z) - \mu_1(Z)^2.$$

Solution

Step 1: The formula

We need to find a simple formula for $\Delta_X(Y,Y^\prime)=\operatorname{Var}(X,Y) - \operatorname{Var}(X,Y^\prime)$ and then analyze it: the problem is to show this is never negative.

The preceding formulas algebraically simplify, without any trouble, to

$$\eqalign{ n^2(n-1)^2\Delta_X(Y,Y^\prime) &= n(2 + n(n-2)(2-\delta^2))\\ &+ (n-1)(n-2)\left(2(n-1)(\bar x)^2 - n\sigma^2\right). }$$

Step 2: Analysis of the formula

Apply the inequalities in $(*)$ and $(**)$ by substituting $\delta^2=1,$ $\sigma^2=1,$ and $\bar x = 0$ to obtain

$$\eqalign{&n^2(n-1)^2\Delta_X(Y,Y^\prime)\\& \ge n(2 + n(n-2)(2-1^2)) + (n-1)(n-2)(2(n-1)(0)^2 - n(1)) \\ &=n^2 \gt 0, }$$

QED.

Thank you very much for your detailed answer. Please refer to my "answer" just posted. I would have liked to post a comment but a comment that long is not allowed. — Jigoku Shoujo, Aug 12 '18 at 10:21
Although the proof is not difficult to follow after your careful explanation, I did not expect that the proof of such an intuitively obvious statement (a statement that can well be understood by high school students) to require this level of mathematical sophistication. Thank you very much for your fabulous proof! — Jigoku Shoujo, Aug 15 '18 at 16:03

Effect of adding and removing data on variance

1 Answers1

Framing the problem

Simplifying the problem with preliminary manipulations and inequalities

Familiar Definitions

Solution

Linked