32

In basic under-grad statistics courses, students are (usually?) taught hypothesis testing for the mean of a population.
Why is it that the focus is on the mean and not on the median? My guess is that it is easier to test the mean due to the central limit theorem, but I'd love to read some educated explanations.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
nafrtiti
  • 665
  • 1
  • 6
  • 9
  • 3
    The mean has useful properties for uniqueness, calculation, and calculus. It is often related to the sufficient statistics. – Henry Oct 08 '17 at 08:07

3 Answers3

41

Because Alan Turing was born after Ronald Fisher.

In the old days, before computers, all this stuff had to be done by hand or, at best, with what we would now call calculators. Tests for comparing means can be done this way - it's laborious, but possible. Tests for quantiles (such as the median) would be pretty much impossible to do this way.

For example, quantile regression relies on minimizing a relatively complicated function.This would not be possible by hand. It is possible with programming. See e.g. Koenker or Wikipedia.

Quantile regression has fewer assumptions than OLS regression and provides more information.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 7
    At that time [computers](https://en.m.wikipedia.org/wiki/Human_computer) did exist but meant something very different from what we mean with it now. – Maarten Buis Oct 08 '17 at 14:22
  • 6
    Indeed! Computers were people who did the calculations. – Peter Flom Oct 08 '17 at 14:23
  • @PeterFlom - but today, students do learn programming etc. So you're saying the Statistics curriculum is what it is for historic reasons? Why isn't the syllabus changed? – nafrtiti Oct 09 '17 at 06:39
  • 2
    @nafrtiti The syllabus is changing, but slowly. There is a lot of momentum to overcome and people outside statistics are not used to the new ideas so may reject them. – Peter Flom Oct 09 '17 at 11:06
  • But isn't calculating quantiles (almost) as laborious as calculating the average? For the former, we just need to sort them and do some counting, whereas for the later, we have to add all the numbers up. Sorting can be done with [Quicksort](https://en.wikipedia.org/wiki/Quicksort), which on average merely takes $O(n log n)$ comparisons to sort $n$ items, or other even more efficient algorithms. Also, say, comparing 3.1415926 and 2.7182818 sounds much easier than adding them together! – nalzok Oct 09 '17 at 14:53
  • 3
    @SunQingyao Sorting is much more expensive than adding. Adding is O(n) and it's one of the most basic operations of hardware and requires only one register. In addition to that, all I need to know is the total and the number of items to more data and calculate the new mean. In order to calculate the median, I need the entire set – JimmyJames Oct 09 '17 at 15:29
  • 2
    It's not about figuring out quantiles vs. means as much as it is the math behind OLS regression vs. quantile regression or the math of the bootstrap and randomization tests and so on. – Peter Flom Oct 09 '17 at 17:58
  • 3
    With Quick select (and using median-of-5 to select pivot if bad pivots are random chosen) you can find a quantile in O(N), making the gap between median and average smaller. Of course you need to know that such methods exists (which was unknown even at Turings time). – Surt Oct 09 '17 at 21:21
22

I would like to add a third reason to the correct reasons given by Harrell and Flom. The reason is that we use Euclidean distance (or L2) and not Manhattan distance (or L1) as our standard measure of closeness or error. If one has a number of data points $x_1, \ldots x_n$ and one wants a single number $\theta$ to estimate it, an obvious notion is to find the number that minimizes the 'error' that number creates the smallest difference between the chosen number and the numbers that constitute the data. In mathematical notation, for a given error function E, one wants to find $min_{\theta \in \Bbb{R}} (E(\theta,x_1, \ldots x_n) = min_{\theta \in \Bbb{R}}(\sum_{i=1}^{i=n} E(\theta,x_i)) $ . If one takes for E(x,y) the L2 norm or distance, that is $E(x,y) = (x-y)^2 $ then the minimizer over all $\theta \in \Bbb{R}$ is the mean. If one takes the L1 or Manhattan distance, the minimizer over all $\theta \in \Bbb{R}$ is the median. Thus the mean is the natural mathematical choice - if one is using L2 distance !

meh
  • 1,902
  • 13
  • 18
  • 7
    Since $E$ is broadly used to denote *expectation*, I suggest replacing $E$ with, say, $\text{Err}$. – Richard Hardy Oct 10 '17 at 09:23
  • 3
    Perhaps it is worth noting that $x^2$ is differentiable at $x=0$ while $|x|$ is not. In my opinion, this is a subtle but key underlying reason why MSE is more prevalent in the mathematical statistics arena than MAE. – Just_to_Answer Nov 04 '17 at 21:32
  • 1
    @Just_to_Answer - I think that is yet another reason-sort of. I've thought about this a lot over the years. For me, I've concluded that the what you say is tied up with why we generally use Euclidean and not Manhattan distance :) – meh Nov 07 '17 at 21:15
19

Often the mean is chosen over the median not because it's more representative, robust, or meaningful but because people confuse estimator with estimand. Put another way, some choose the population mean as the quantity of interest because with a normal distribution the sample mean is more precise than the sample median. Instead they should think more, as you have done, about the true quantity of interest.

One sidebar: we have a nonparametric confidence interval for the population median but there is no nonparametric method (other than perhaps the numerically intensive empirical likelihood method) to get a confidence interval for the population mean. If you want to stay distribution-free you might concentrate on the median.

Note that the central limit theorem is far less useful than it seems, as been discussed elsewhere on this site. It effectively assumes that the variance is known or that the distribution is symmetric and has a shape such that the sample variance is a competitive estimator of dispersion.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 2
    I believe it's possible to construct a nonparametric confidence interval for the mean - say via a permutation test (this can be done under an assumption of symmetry without assuming any specific functional form, for example). That's a somewhat restricted situation, though it's also possible under some other assumptions than symmetry. If you're prepared to deal with the approximate coverage that comes with bootstrapping one can get nonparametric intervals without assumptions like symmetry. – Glen_b Oct 08 '17 at 13:36
  • 2
    If it assumes symmetry it is parametric. Haven't seen this extended to non-symmetric cases. The bootstrap (all variants except perhaps the studentized t method) is extremely inaccurate under severe asymmetry. See https://stats.stackexchange.com/questions/186957 – Frank Harrell Oct 08 '17 at 13:57
  • 5
    *Symmetry* is not finite-parametric. A Wilcoxon signed rank test assumes symmetry (in order to have exhangeability of signs) under the null. You'd call that parametric? – Glen_b Oct 09 '17 at 03:41
  • @FrankHarrell - thank you for the informative answer. Do you by any chance you have a link to the discussion on the central limit theorem? – nafrtiti Oct 09 '17 at 06:45
  • 2
    https://stats.stackexchange.com/questions/9573 https://stats.stackexchange.com/questions/186957 – Frank Harrell Oct 09 '17 at 12:05
  • 2
    On @Glen_b question about symmetry - that's an excellent question. The Wilcoxon signed-rank test is an interesting case because, unlike the WIlcoxon 2-sample test, makes a heavy symmetry assumption. I guess you could say that you can be non-parametric while still requiring some kind of general assumption such as symmetry. Maybe the terminology should be "nonparametric with restrictions"? On the other hand the nonparametric 2-sample test has restrictions with regard to what optimizes type II error (but not type I error). – Frank Harrell Oct 14 '17 at 13:57
  • 1
    "Nonparametric" is a funny term; we tend to give it additional shades of meaning (I know I often do). Formally, it's simply about counting parameters -- if you can specify a fixed number of parameters and completely specify the model, its parametric. If the parameters are not fixed in number (e.g. could potentially be uncountable), it's not parametric. There are numerous situations where there are some kind of restrictive assumptions (even *continuity* is arguably a pretty strong assumption) but not necessarily parametric ones. Perhaps you'll disagree but I don't see symmetry ... ctd – Glen_b Oct 14 '17 at 22:41
  • 1
    ctd... as very strong in the paired case, since it amounts to assuming no effect under the null (whereupon the differences should be symmetric). For the one-sample case I'd agree it's a fairly strong assumption, though typically relatively weak compared to common parametric assumptions. – Glen_b Oct 14 '17 at 22:51
  • I think that's a good summary of the situation. I've often wondered if we could find better terminology. For some things, distribution-free is a good term. But I usually use semiparametric models to do 'nonparametric' tests and using 'semiparametric' is pretty clear. – Frank Harrell Oct 15 '17 at 00:06