Why does the standard deviation sum squares of deviations from the mean instead of absolute deviations?

Question

(This is a question from a friend of mine who is a secondary school mathematics teacher --- I am posting it on his behalf.)

My question is about the standard deviation. I was talking to a colleague about why the standard-deviation is the square-root of the sum of squares, rather than just the sum of the absolute values. I've though about it while ago but got stuck at "its useful for further calcuations, just a manufactured measure when variance is the most used", etc.

Could you shed any light on the matter for a couple of inquisitive teachers?

There are many things that distinguish the SD. One is its close relation to the variance, which is the first (useful) cumulant--and cumulants add. Another is its intimate connection with the CLT, as [I explain in another post](https://stats.stackexchange.com/a/3904/919). — whuber, Apr 03 '20 at 14:41

Ben · Answer 1 · 2022-03-03T09:03:51.490

The alternative statistic you are describing is called the mean absolute deviation (MAD). Both statistics can be computed from a data vector, and they are both used as measures of spread. The reason that the standard deviation is more commonly used as a measure of spread is that it has better properties than the mean absolute deviation in most contexts. One of the desirable properties of the sample variance (the square of the sample standard deviation) is that it is an unbiased estimator of the true variance for any sample of independent and identically distributed (IID) data points.

One way to think about this is to look at it geometrically. If you have a set of $n$ data points then you get an associated set of $n$ deviations from the mean. You can consider that as a vector in $n$-dimensional space, and in this case, the vector norm is the sum of squared deviations, so the sample standard deviation will be proportionate to the vector norm. So one way of looking at the sample standard deviation is that it is a scaled measure of the length of the vector of deviations from the mean.

Using p-norms around the mean as measures of spread: To get a more unified geometric perspective on different measures of spread, it is useful to note that almost all of them are scaled versions of p-norms of the vector of deviations around a central point (see e.g. here). The MAD is constructed from the p-norm around the mean with $p=1$ and the SD is constructed from the p-norm around the mean with $p=2$. Suppose you have a data vector $\mathbf{x} = (x_1,...,x_n)$ with sample mean $\bar{x}_n$. For any integer $1 \leqslant p \leqslant \infty$ we can construct a measure of spread given by:

$$\text{Spread around the mean}_p = k_{p,n} \times \Big( \sum_{i=1}^n |x_i - \bar{x}_n|^p \Big)^{1/p},$$

where $k_{p,n}$ is some scaling factor designed to adjust the measure for the value of $n$ (used to make the measure of spread comparable across data vectors of different lengths). We could potentially use any value of $1 \leqslant p \leqslant \infty$ for this norm, depending on how much we want large deviations to contribute to the spread, relative to small deviations. The larger we set the value $p$ the more that large deviations contribute to the spread related to small deviations. Some particular instances of this norm statistic for $p = 1, 2, 3, ..., \infty$ are:

$$\begin{aligned} \text{Spread around the mean}_1 &= k_{1,n} \times \sum_{i=1}^n |x_i - \bar{x}_n|, \\[6pt] \text{Spread around the mean}_2 &= k_{2,n} \times \sqrt{ \sum_{i=1}^n (x_i - \bar{x}_n)^2}, \\[6pt] \text{Spread around the mean}_3 &= k_{3,n} \times \sqrt[3]{ \sum_{i=1}^n |x_i - \bar{x}_n|^3}, \\[6pt] &\ \ \vdots \\[12pt] \text{Spread around the mean}_\infty &= k_{\infty,n} \times \max_i |x_i - \bar{x}_n|. \\[6pt] \end{aligned}$$

For $p=1$ the deviations from the mean are weighted linearly, so a deviation that is twice as large contributes twice as much to the spread. This leads to the MAD as a measure of spread. For $p = 2$ the deviations are weighted quadratically, so a deviation that is twice as large contributes four times as much to the spread. This leads to the SD as a measure of spread. For $p=\infty$ the largest deviation has all the weight, and entirely determines the spread. This leads to the range as a measure of spread.

All of these measures of spread have different properties, and their usefulness depends on their properties. (Note that all measures of spread constructed from p-norms have some baseline properties that make them useful. In particular, they obey the properties of norms --- they gives zero spread only for a data vector with identical values, they are "absolutely scalable", and they obey the triangle inequality with respect to the vector of deviations from the sample mean.) It turns out that the underlying moments of a probability distribution, including its variance, are quite important properties, and so the sample standard deviation also becomes quite important, because the sample variance has a number of useful estimation properties for the true variance. That is the main reason that it is the most widely used measure of spread.

Why does the standard deviation sum squares of deviations from the mean instead of absolute deviations?

1 Answers1