Can the mean lie outside of the interquartile range? I realize that extreme outliers can affect or pull the mean, but can it pull the mean outside of the interval from the first quartile to the third quartile?
-
1Briefly consider this dataset: $(1,2,3,4,10^6)$. Its implications for arbitrary distributions will be clear, I hope. – whuber Oct 15 '14 at 18:55
-
Absolutely. Suppose values are 1,1,1,1,1,1,64. Then the range between the quartiles is from 1 to 1 but the mean is 70/7 = 10. (Pedantic point: the interquartile range is the difference between the quartiles, not the interval from the lower quartile to the upper quartile, but we all know what you mean.) – Nick Cox Oct 15 '14 at 18:58
-
1The mean can be arbitrarily far from the interior of the region between the quartiles ... or indeed 10th and 90th percentiles, or 1st and 99th percentiles, ... Take some data (at least 5 points for quartiles, at least 11 for 10th to 90th percentiles and so on), and then fix all but the most extreme observation. As you move it further away from the rest of the data, the quartiles (and other non-extreme percentiles) stay put, but the mean moves 1/n as far as you move that data point. So if you want to shift the mean up by $10^6$, move that point $n\cdot 10^6$. – Glen_b Oct 15 '14 at 21:44
-
-1 This feels like homework. – Unknown Coder Oct 15 '14 at 22:16
1 Answers
If "mean" refers to a statistic for a batch of data, then consider the dataset $(1,2,3,4,10^6)$ whose quartiles must lie between $1$ and $4$ (depending on how you compute them) but whose mean is $200,002$.
If instead it refers to a property of a distribution, then assign a probability of $1/5$ to each of the five numbers in the previous batch to create a (discrete) distribution. The same calculations apply, leading to the same conclusions.
The point is that quartiles are resistant to changes in the data, whereas the mean is sensitive to changes in even any one data value. When we add $\epsilon$ to any single value in a dataset of $n\gt 4$ numbers, the mean changes by $\epsilon/n$--which may be arbitrarily large--but the quartiles (if they change at all) only shift to the neighboring values in the original dataset and therefore are limited in how much they can change. The preceding example exploited this in an extreme way.
Influence functions study how such changes in data values create changes in statistical summaries of those values.