20

This may be a simple question for many but here it is:

Why isn't variance defined as the difference between every value following each other instead of the difference to the average of the values?

This would be the more logical choice to me, I guess I'm obviously overseeing some disadvantages. Thanks

EDIT:

Let me rephrase as clearly as possible. This is what I mean:

  1. Assume you have a range of numbers, ordered: 1,2,3,4,5
  2. Calculate and sum up (the absolute) differences (continuously, between every following value, not pairwise) between values (without using the average).
  3. Divide by number of differences
  4. (Follow-up: would the answer be different if the numbers were un-ordered)

-> What are the disadvantages of this approach compared to the standard formula for variance?

user2305193
  • 358
  • 2
  • 11
  • 1
    You may be also interested in reading about autocorrelation (e.g. http://stats.stackexchange.com/questions/185521/measuring-dependency-of-subsequent-points-from-markov-chain/185524#185524 ). – Tim Jul 26 '16 at 18:21
  • 2
    @user2305193 whuber's answer is correct, but his formula utilizes the squared distance between an ordering of the data, and the averaging over all orderings. Neat trick, however the process of finding the variance that you have indicated, is exactly what I tried to implement in my answer, and demonstrated would not do a good job. Trying to clear the confusion. – Greenparker Jul 26 '16 at 19:17
  • 1
    For fun, look up the Allan Variance. – hobbs Jul 27 '16 at 04:43
  • on another thought, I guess since you don't square differences (and you don't take the square-root afterwards) but take the absolute values, this should be rather 'why isn't this how we calculate the the standard deviation' instead of 'why isn't this how we calculate variance'. But I'll give it a rest now – user2305193 Jul 27 '16 at 15:24

8 Answers8

37

It is defined that way!

Here's the algebra. Let the values be $\mathbf{x}=(x_1, x_2, \ldots, x_n)$. Denote by $F$ the empirical distribution function of these values (which means each $x_i$ contributes a probability mass of $1/n$ at the value $x_i$) and let $X$ and $Y$ be independent random variables with distribution $F$. By virtue of basic properties of variance (namely, it is a quadratic form) as well as the definition of $F$ and the fact $X$ and $Y$ have the same mean,

$$\eqalign{ \operatorname{Var}(\mathbf{x})&=\operatorname{Var}(X) = \frac{1}{2}\left(\operatorname{Var}(X) + \operatorname{Var}(Y)\right)=\frac{1}{2}\left(\operatorname{Var}(X-Y)\right)\\ &=\frac{1}{2}\left(\mathbb{E}((X-Y)^2) - \mathbb{E}(X-Y)^2\right)\\ &=\mathbb{E}\left(\frac{1}{2}(X-Y)^2\right) - 0\\ &=\frac{1}{n^2}\sum_{i,j}\frac{1}{2}(x_i - x_j)^2. }$$

This formula does not depend on the way $\mathbf{x}$ is ordered: it uses all possible pairs of components, comparing them using half their squared differences. It can, however, be related to an average over all possible orderings (the group $\mathfrak{S}(n)$ of all $n!$ permutations of the indices $1,2,\ldots, n$). Namely,

$$\operatorname{Var}(\mathbf{x})=\frac{1}{n^2}\sum_{i,j}\frac{1}{2}(x_i - x_j)^2 = \frac{1}{n!}\sum_{\sigma\in\mathfrak{S}(n)} \frac{1}{n} \sum_{i=1}^{n-1} \frac{1}{2}(x_{\sigma(i)} - x_{\sigma(i+1)})^2.$$

That inner summation takes the reordered values $x_{\sigma(1)}, x_{\sigma(2)}, \ldots, x_{\sigma(n)}$ and sums the (half) squared differences between all $n-1$ successive pairs. The division by $n$ essentially averages these successive squared differences. It computes what is known as the lag-1 semivariance. The outer summation does this for all possible orderings.


These two equivalent algebraic views of the standard variance formula give new insight into what the variance means. The semivariance is an inverse measure of the serial covariance of a sequence: the covariance is high (and the numbers are positively correlated) when the semivariance is low, and conversely. The variance of an unordered dataset, then, is a kind of average of all possible semivariances obtainable under arbitrary reorderings.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • There is a flaw in the 6th equality. You are mistaking the expected value of a estimator with the actual value of a parameter. (unless you use lowercase and uppercase notation interchangeably) – Mur1lo Jul 26 '16 at 23:46
  • 1
    @Mur1lo On the contrary: I believe this derivation is correct. Apply the formula to some data and see! – whuber Jul 27 '16 at 01:18
  • 1
    I think Mur1lo may have been talking not about the correctness of the formula for variance but about apparently passing directly from expectations of random variables to functions of sample quantities. – Glen_b Jul 27 '16 at 01:35
  • 1
    @glen But that's precisely what the empirical distribution function lets us do. That's the entire point of this approach. – whuber Jul 27 '16 at 01:42
  • 3
    Yes, that's clear to me; I was trying to point out where the confusion seemed to lay. Sorry to be vague. Hopefully it's clearer now why it only appears* to be a problem. $\:$ *(this why I used the word "apparent" earlier, to emphasize it was just the out-of-context appearance of that step that was likely to be the cause of the confusion) – Glen_b Jul 27 '16 at 02:00
  • @whuber Yes, Glen_b got right what I was talking about. But I don't get what you mean when you say "that's precisely what the empirical distribution function lets us do". I don't see a "apparently passing directly from expectations of random variables to functions of sample quantities" I see a clear one :). Nothing wrong with the reasoning btw. – Mur1lo Jul 27 '16 at 02:28
  • 2
    @Mur1o The only thing I have done in any of these equations is to apply definitions. There is no passing from expectations to "sample quantities". (In particular, no sample of $F$ has been posited or used.) Thus I am unable to identify what the apparent problem is, nor suggest an alternative explanation. If you could expand on your concern then I might be able to respond. – whuber Jul 27 '16 at 11:17
  • Now I see what the problem was. I was wrongly assuming that the questioner was asking about different ways of estimating the variance of random variables, but only realized that he wasn't after reading your comment on my answer. My apologies. – Mur1lo Jul 28 '16 at 01:23
  • Nevertheless I would like to say in my defense that the word [variance](https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance) is most commonly used as a synonymous of **population variance**. – Mur1lo Jul 28 '16 at 01:32
  • 1
    @Mur1lo I feel obliged to point out that the population variance is the *only* sense employed in my answer. – whuber Jul 28 '16 at 04:09
27

The most obvious reason is that there is often no time sequence in the values. So if you jumble the data, it makes no difference in the information conveyed by the data. If we follow your method, then every time you jumble the data you get a different sample variance.

The more theoretical answer is that sample variance estimates the true variance of a random variable. The true variance of a random variable $X$ is $$E\left[ (X - EX)^2 \right]. $$

Here $E$ represents expectation or "average value". So the definition of the variance is the average squared distance between the variable from its average value. When you look at this definition, there is no "time order" here since there is no data. It is just an attribute of the random variable.

When you collect iid data from this distribution, you have realizations $x_1, x_2, \dots, x_n$. The best way to estimate the expectation is to take the sample averages. The key here is that we got iid data, and thus there is no ordering to the data. The sample $x_1, x_2, \dots, x_n$ is the same as the sample $x_2, x_5, x_1, x_n..$

EDIT

Sample variance measures a specific kind of dispersion for the sample, the one that measures the average distance from the mean. There are other kinds of dispersion like range of data, and Inter-Quantile range.

Even if you sort your values in ascending order, that does not change the characteristics of the sample. The sample (data) you get are realizations from a variable. Calculating the sample variance is akin to understanding how much dispersion is in the variable. So for example, if you sample 20 people, and calculate their height, then those are 20 "realizations" from the random variable $X = $ height of people. Now the sample variance is supposed to measure the variability in the height of individuals in general. If you order the data $$ 100, 110, 123, 124, \dots,$$

that does not change the information in the sample.

Lets look at one more example. lets say you have 100 observations from a random variable ordered in this way $$1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ... 100.$$ Then the average subsequent distance is 1 units, so by your method the variance will be 1.

The way to interpret "variance" or "dispersion" is to understand what range of values are likely for the data. In this case you will get a range of .99 unit, which of course does not represent the variation well.

If instead of taking average you just sum the subsequent differences, then your variance will be 99. Of course that does not represent the variability in the sample, because 99 gives you the range of the data, not a sense of variability.

Greenparker
  • 14,131
  • 3
  • 36
  • 80
  • 1
    With the last paragraph you reached out to me, haha, thanks for this flabbergasting answer, I wish I had enough rep to upvote it, please people, do it for me ;-) ACCEPTED!!! – user2305193 Jul 26 '16 at 17:40
  • Follow-up-to-follow-up: What I really meant (yeah, sorry, I only realized the right question after reading your answer) was you sum up the differences and divide it through the number of samples. In your last example that would be 99/100 - can you elaborate on that for complete flabbergasted-ness? – user2305193 Jul 26 '16 at 17:50
  • @user2305193 Right, I said 1 unit on average, which is incorrect. It should have been .99 units. Changed it. – Greenparker Jul 26 '16 at 17:52
  • For more info on the 1-100 series: the variance in 1-100 would be 841.7 and the standard deviation 29.01 [source](https://www.wolframalpha.com/input/?i=%5B1...100%5D). So indeed quite a different result. – user2305193 Jul 28 '16 at 11:50
15

Just a complement to the other answers, variance can be computed as the squared difference between terms:

$$\begin{align} &\text{Var}(X) = \\ &\frac{1}{2\cdot n^2}\sum_i^n\sum_j^n \left(x_i-x_j\right)^2 = \\ &\frac{1}{2\cdot n^2}\sum_i^n\sum_j^n \left(x_i - \overline x -x_j + \overline x\right)^2 = \\ &\frac{1}{2\cdot n^2}\sum_i^n\sum_j^n \left((x_i - \overline x) -(x_j - \overline x\right))^2 = \\ &\frac{1}{n}\sum_i^n \left(x_i - \overline x \right)^2 \end{align}$$

I think this is the closest to the OP proposition. Remember the variance is a measure of dispersion of every observation at once, not only between "neighboring" numbers in the set.


UPDATE

Using your example: $X = {1, 2, 3, 4, 5}$. We know the variance is $Var(X) = 2$.

With your proposed method $Var(X) = 1$, so we know beforehand taking the differences between neighbors as variance doesn't add up. What I meant was taking every possible difference squared then summed:

$$Var(X) = \\ = \frac{(5-1)^2+(5-2)^2+(5-3)^2+(5-4)^2+(5-5)^2+(4-1)^2+(4-2)^2+(4-3)^2+(4-4)^2+(4-5)^2+(3-1)^2+(3-2)^2+(3-3)^2+(3-4)^2+(3-5)^2+(2-1)^2+(2-2)^2+(2-3)^2+(2-4)^2+(2-5)^2+(1-1)^2+(1-2)^2+(1-3)^2+(1-4)^2+(1-5)^2}{2 \cdot 5^2} = \\ =\frac{16+9+4+1+9+4+1+1+4+1+1+4+1+1+4+9+1+4+9+16}{50} = \\ =2$$

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • Now I'm seriously confused guys – user2305193 Jul 26 '16 at 18:28
  • @user2305193 In your question, did you mean every pairwise difference or did you mean the difference between a value and the next in a sequence? Could you please clarify? – Firebug Jul 26 '16 at 18:31
  • the continuous, sorted, values, so assuming a range [1...5] 1. sum up the differences: 1+1+1+1 = 4 2. divide by number of samples (or differences): 4/5 (4/4) – user2305193 Jul 26 '16 at 18:59
  • I understand your answer now, thanks for the update – user2305193 Jul 26 '16 at 20:54
  • Why is everyone mistaking **sample variance** for **population variance**? The first equality is where the confusion is. – Mur1lo Jul 26 '16 at 22:43
  • 2
    @Mur1lo no one is though, I have no idea what you're referring to. – Firebug Jul 26 '16 at 22:49
  • "Variance can be computed as the squared difference between terms" No it can't. By your computation you can **estimate** the variance. Just think about a a Cauchy r.v. and what would result from your calculation. – Mur1lo Jul 26 '16 at 22:53
  • 2
    @Mur1lo This is a general question, and I answered it generally. Variance is a computable parameter, which can be estimated from samples. This question isn't about estimation though. Also we are talking about discrete sets, not about continuous distributions. – Firebug Jul 26 '16 at 22:57
  • 1
    You showed how to estimate the variance by its U-statistic and its fine. The problem is when you write: Var("upper case"X) = things involving "lower case" x, you are mixing the two different notions of parameter and of estimator. – Mur1lo Jul 26 '16 at 23:02
  • $X$ is the set of $x_i$ in my example, you're conflating it to population statistics due to a previous (lecturebook induced bias). – Firebug Jan 30 '20 at 16:19
6

Others have answered about the usefulness of variance defined as usual. Anyway, we just have two legitimate definitions of different things: the usual definition of variance, and your definition.

Then, the main question is why the first one is called variance and not yours. That is just a matter of convention. Until 1918 you could have invented anything you want and called it "variance", but in 1918 Fisher used that name to what is still called variance, and if you want to define anything else you will need to find another name to name it.

The other question is if the thing you defined might be useful for anything. Others have pointed its problems to be used as a measure of dispersion, but it's up to you to find applications for it. Maybe you find so useful applications that in a century your thing is more famous than variance.

Pere
  • 5,875
  • 1
  • 13
  • 29
  • I know every definition is up to the people deciding on it, I really was looking for help in up/downsides for each approaches. Usually there's good reason for people converging to a definition and as I suspected didn't see why straight away. – user2305193 Jul 26 '16 at 17:44
  • 1
    Fisher introduced variance as a term in 1918 but the idea is older. – Nick Cox Jul 26 '16 at 18:09
  • As far as I know, Fisher was the first one to use the name "variance" for variance. That's why I say that before 1918 you could have use "variance" to name anything else you had invented. – Pere Jul 26 '16 at 20:30
3

@GreenParker answer is more complete, but an intuitive example might be useful to illustrate the drawback to your approach.

In your question, you seem to assume that the order in which realisations of a random variable appear matters. However, it is easy to think of examples in which it doesn't.

Consider the example of the height of individuals in a population. The order in which individuals are measured is irrelevant to both the mean height in the population and the variance (how spread out those values are around the mean).

Your method would seem odd applied to such a case.

Antoine Vernet
  • 1,334
  • 16
  • 24
2

Although there are many good answers to this question I believe some important points where left behind and since this question came up with a really interesting point I would like to provide yet another point of view.

Why isn't variance defined as the difference between every value following    
each other instead of the difference to the average of the values?

The first thing to have in mind is that the variance is a particular kind of parameter, and not a certain type of calculation. There is a rigorous mathematical definition of what a parameter is but for the time been we can think of then as mathematical operations on the distribution of a random variable. For example if $X$ is a random variable with distribution function $F_X$ then its mean $\mu_x$, which is also a parameter, is:

$$\mu_X = \int_{-\infty}^{+\infty}xdF_{X}(x)$$

and the variance of $X$, $\sigma^2_X$, is:

$$\sigma^2_X = \int_{-\infty}^{+\infty}(x - \mu_X)^2dF_{X}(x)$$

The role of estimation in statistics is to provide, from a set of realizations of a r.v., a good approximation for the parameters of interest.

What I wanted to show is that there is a big difference in the concepts of a parameters (the variance for this particular question) and the statistic we use to estimate it.

Why isn't the variance calculated this way?

So we want to estimate the variance of a random variable $X$ from a set of independent realizations of it, lets say $x = \{x_1,\ldots,x_n\}$. The way you propose doing it is by computing the absolute value of successive differences, summing and taking the mean:

$$\psi(x) = \frac{1}{n}\sum_{i = 2}^{n}|x_i - x_{i-1}|$$

and the usual statistic is:

$$S^2(x) = \frac{1}{n-1}\sum_{i = i}^{n}(x_i - \bar{x})^2,$$

where $\bar{x}$ is the sample mean.

When comparing two estimator of a parameter the usual criterion for the best one is that which has minimal mean square error (MSE), and a important property of MSE is that it can be decomposed in two components:

MSE = estimator bias + estimator variance.

Using this criterion the usual statistic, $S^2$, has some advantages over the one you suggests.

  • First it is a unbiased estimator of the variance but your statistic is not unbiased.

  • One other important thing is that if we are working with the normal distribution then $S^2$ is the best unbiased estimator of $\sigma^2$ in the sense that it has the smallest variance among all unbiased estimators and thus minimizes the MSE.

When normality is assumed, as is the case in many applications, $S^2$ is the natural choice when you want to estimate the variance.

Mur1lo
  • 1,225
  • 7
  • 15
  • 4
    Everything in this answer is well explained, correct, and interesting. However, introducing the "usual statistic" as an *estimator* confuses the issue, because the question is not about estimation, nor about bias, nor about the distinction between $1/n$ and $1/(n-1)$. That confusion might be at the root of your comments to several other answers in this thread. – whuber Jul 27 '16 at 11:23
2

The time-stepped difference is indeed used in one form, the Allan Variance. http://www.allanstime.com/AllanVariance/

1

Lots of good answers here, but I'll add a few.

  1. The way it is defined now has proven useful. For example, normal distributions appear all the time in data and a normal distribution is defined by its mean and variance. Edit: as @whuber pointed out in a comment, there are various other ways specify a normal distribution. But none of them, as far as I'm aware, deal with pairs of points in sequence.
  2. Variance as normally defined gives you a measure of how spread out the data is. For example, lets say you have a lot of data points with a mean of zero but when you look at it, you see that the data is mostly either around -1 or around 1. Your variance would be about 1. However, under your measure, you would get a total of zero. Which one is more useful? Well, it depends, but its not clear to me that a measure of zero for its "variance" would make sense.
  3. It lets you do other stuff. Just an example, in my stats class we saw a video about comparing pitchers (in baseball) over time. As I remember it, pitchers appeared to be getting worse since the proportion of pitches that were hit (or were home-runs) was going up. One reason is that batters were getting better. This made it hard to compare pitchers over time. However, they could use the z-score of the pitchers to compare them over time.

Nonetheless, as @Pere said, your metric might prove itself very useful in the future.

roundsquare
  • 700
  • 3
  • 13
  • 1
    A normal distribution can also be determined by its mean and fourth central moment, for that matter -- or by means of many other pairs of moments. The variance is not special in that way. – whuber Jul 27 '16 at 01:29
  • @whuber interesting. I'll admit I didn't realize that. Nonetheless, unless I'm mistaken, all the moments are "variance like" in that they are based on distances from a certain point as opposed to dealing with pairs of points in sequence. But I'll edit my answers to make note of what you said. – roundsquare Jul 27 '16 at 12:09
  • 1
    Could you explain the sense in which you mean "deal with pairs of points in sequence"? That's not a part of any standard definition of a moment. Note, too, that all the absolute moments around the mean--which includes all even moments around the mean--give a "measure of how spread out the data" are. One could, therefore, construct an analog of the Z-score with them. Thus, none of your three points appears to differentiate the variance from any absolute central moment. – whuber Jul 27 '16 at 12:22
  • @whuber yeah. The original question posited a 4 step sequence where you sort the points, take the differences between each point and the next point, and then average these. That's what I referred to as "deal[ing] with pairs of points in sequence". So you are right, none of the three points I gave distinguishes variance from any absolute central moment - they are meant to distinguish variance (and, I suppose, all absolute central moments) from the procedure described in the original question. – roundsquare Jul 27 '16 at 13:10