6

I am trying to wrap my mind around the variance definition.

Given a set of values S and n = #(S), the variance is defined as:

$$ \operatorname{var}(S) = \frac{\sum_{i=1}^n( S_i - \operatorname{mean}(S) )^2} n $$

And the square root of that (standard deviation) measures how far away the values are on average from the mean.

However, there is a simpler formula that also measures how far away the values are from the mean:

$$ \operatorname{another Possible Def For Var}(S) = \frac{\sum_{i=1}^n|S_i - \operatorname{mean}(S)|}{n} $$

I am trying to understand the reasoning behind the fact we use square root instead of a simpler modules function there. Is there a real reason why variance was defined the first way and not the second way?

# EDIT #

Ok, looks like the given reasons so far are way more advanced than what I was expecting.

The argument of squaring it as opposed to taking the modulus saying the modulus make the math more complicated is valid, but more of a consequence of the definition rather than a reason for it being defined as it is IMHO. Same thing goes for the Central Limit Theorem.

I ended up finding the exact same question at Khan Academy. There, the following reasons were also given:

  1. "Squaring emphasizes larger differences (think of the effect outliers have)." Another comment also points out: "In addition to amplifying large differences from the mean, squaring also MINIMIZES tiny differences from the mean".

These are the most convincing reasons I found so far. The modulus will not emphasize large values, neither will it minimize small values. HOWEVER, the same argument goes to any even power. A power of 4 will also amplify large differences and minimize tiny differences (it will actually do a better job at those). So why not take the power of 4 then? (or any other even number for that matter).

  1. "(...) you can also view the equation as being the Euclidean distance between all the points and the mean of the points"

That's more of a "nice-to-have" than a reason to me. If anything, the modules would give the Manhattan distance. So what?

Having said all that, I am not 100% convinced yet. I believe this question is way deeper than it looks at first glance and judging from the Khan Academy number of upvotes, I am not the only one confused about it.

  • 1
    No, it is the standard deviation which is the square root of the variance that is the unit of distance from the mean. What you are suggesting is called the mean absolute deviation. The standard deviation is used because the variance is a natural parameter for the normal distribution – Michael R. Chernick Apr 28 '17 at 00:01
  • @MichaelChernick Could you clarify what you mean by "variance is a natural parameter for the normal distribution"? – Bitcoin Cash - ADA enthusiast Apr 28 '17 at 00:12
  • The normal distribution is a family with two parameters (the mean and the variance). Both appear in the formula of the general univariate normal density. – Michael R. Chernick Apr 28 '17 at 00:21
  • 1
    Another (somewhat circular?) justification relates to [this recent question](https://stats.stackexchange.com/questions/275034/is-there-a-more-intuitive-statistic-than-the-standard-error-of-the-regression-es): If you want to approximate the data by a constant value, and you choose the mean as that value, you are minimizing the variance. If you choose the median as the constant, you are minimizing the [mean absolute deviation](https://en.wikipedia.org/wiki/Average_absolute_deviation#Mean_absolute_deviation_around_the_median) about the constant. – GeoMatt22 Apr 28 '17 at 00:27
  • @MichaelChernick and because of the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the normal distribution comes up quite frequently in practice. (And historically was among the first distributions studied rigorously, e.g. [this](https://en.wikipedia.org/wiki/De_Moivre%E2%80%93Laplace_theorem).) – GeoMatt22 Apr 28 '17 at 00:35
  • 1
    @GeoMatt22 We could also talk about Chebyshev's inequality which involves the variance and applies to all distributions with finite second moments. – Michael R. Chernick Apr 28 '17 at 00:38
  • 1
    Why do you think the modulus function is easier than the square? This is not a trivial question, btw – Aksakal Apr 28 '17 at 01:37
  • Aksakal I was thinking in terms of computation resources. Computing a multiplication is certainly harder than simple changing the signal of a variable. – Bitcoin Cash - ADA enthusiast Apr 28 '17 at 01:40
  • 1
    Your alternative is more difficult in the sense of not being differentiable at the origin, whereas variance is infinitely differentiable everywhere. – Mark L. Stone Apr 28 '17 at 04:06
  • 1
    Whether the absolute value is computationally "simpler" than squaring is debatable, too. In some models of computing the absolute value is *harder* than squaring because evaluating it requires a branch. Another way to appreciate the complexity of the modulus function is to consider what happens when you extend it to the Complex numbers: now you are compelled to compute a complex conjugate, form a product, *and then take a square root.* Not only is this clearly more work than squaring, *it isn't even analytic anywhere* (never mind at the origin). – whuber Apr 28 '17 at 14:28
  • True, an absolute value could be simpler to calculate on a computer, but this can't be the only criterion for a measure of dispersion. Otherwise, the range is even simpler to calculate. The sum of squares is much simpler to manipulate in equations, it's smooth. That's why my question was what is simpler in your mind? This will ultimately answer your question. When you consider a range of uses of the dispersion it's not obvious whether the absolute value simpler or not, often not. – Aksakal Apr 29 '17 at 13:31

2 Answers2

5

Let $\mu=\operatorname{E}(X).$

The main reason for using $\sqrt{\operatorname{var}(X)} = \sqrt{\operatorname{E}((X-\mu)^2)}$ as a measure of dispersion, rather that using the mean absolute deviation $\operatorname{E}(|X-\mu|),$ is that if $X_1,\ldots,X_n$ are independent, then $$ \operatorname{var}(X_1+\cdots+X_n) = \operatorname{var}(X_1)+\cdots+\operatorname{var}(X_n). \tag 1 $$ Nothing like that works with the mean absolute deviation. For example, try it with $X_1,X_2,X_3,\sim\operatorname{i.i.d.} \operatorname{Bernoulli}(1/2).$

In any problem where you use the central limit theorem, you need this.

For example: What is the standard deviation of the number of heads that appear when a coin is tossed $900$ times? That's easy to find because of $(1).$

Michael Hardy
  • 7,094
  • 1
  • 20
  • 38
  • Central limit theorem does not depend on the definition of variance. It is kind of weak converge of the sum of random variable. Even you do not define variance, the sum of random variable met the certain conditions still converges to normal distribution. – user158565 Apr 28 '17 at 02:49
  • Agree with your general explanation (+1) but note that actually independence isn't even necessary as this is the more general formula: $\operatorname{var}\left(\sum_i X_i\right)=\sum_i\sum_j\operatorname{cov}\left(X_i,X_j\right)$. When all covariances between variables are 0, this reduces to your equation (1). And there are other simple rules to figure out how variances change under e.g. scaling operations (see https://en.wikipedia.org/wiki/Variance and https://en.wikipedia.org/wiki/Propagation_of_uncertainty for more info). – Ruben van Bergen Apr 28 '17 at 12:56
  • @a_statistician : How do you know WHICH normal distribution the sum of i.i.d. random variables approximates? The answer is that you know the variance of the sum because it's the sum of the variances. – Michael Hardy Apr 28 '17 at 22:19
  • The mean absolute deviation (E[|X−μ|]) + mean specify a unique normal distribution, same as variance + mean. Parameters of the distribution can be converted or transformed given the function is 1-1. To study the converge process, maybe the best tool is characteristic function. – user158565 Apr 29 '17 at 02:17
  • (+1) I agree this is a useful reason! – GeoMatt22 Apr 29 '17 at 02:43
  • @a_statistician : That is true, but the mean absolute deviation of a sum of independent random variables is not determined by the mean absolute deviations of the separate random variables. – Michael Hardy Apr 29 '17 at 03:27
  • People generally take route A to travel from City X to City Y, because we already build the route A and it is very good. But it does not mean I cannot find another way; maybe this new way is very bad, but it exists. – user158565 Apr 29 '17 at 04:45
1

There are already several good answers here, including in the comments. However as the OP requested a "simpler" justification, here I will expand on my comment.

To me this is a very natural distinction between root-mean-square vs. mean-absolute deviations, and why we might prefer one vs. the other when measuring dispersion. (I do not know if it is "simpler"?)


Say you have some data $x_1,\ldots,x_n$, which you want to approximate by a constant $c$, i.e. $$x_i\approx c$$ for all $i$.

How do you choose the constant? A common approach is to minimize some error $E[c]$.

One choice for $E$ is the sum square error $$E_\text{SSE}=\sum_i\big(x_i-c\big)^2$$ the solution will then be $c_\min=\frac{1}{n}\sum x_i$. In other words, we have $$\big[c_\min,E_\min\big]_\text{SSE}=\big[\text{mean}(\mathbf{x}),n\,\text{var}(\mathbf{x})\big]$$ so if you are using the mean as your measure of central tendency, the RMS error is really the "natural" measure of dispersion.

On the other hand, if we choose $E$ to be the sum absolute error $$E_\text{SAE}=\sum_i\big|x_i-c\big|$$ the solution will then be $(c_\min)_\text{SAE}=\text{median}(\mathbf{x})$. So if you want to use mean absolute deviation to measure dispersion, really the "natural" measure of central tendency would be the median.


Summary: If you want to use mean absolute deviation, then arguably you should be measuring dispersion around the median. If you are already using the mean, then arguably standard deviation is the appropriate measure of dispersion. Here "arguably" is justified by optimality (minimum dispersion).

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • This is just a specialization of my answer [here](https://stats.stackexchange.com/questions/275034/is-there-a-more-intuitive-statistic-than-the-standard-error-of-the-regression-es) to the case of "regression with *only* an intercept". – GeoMatt22 Apr 29 '17 at 02:47