Should the mean be used when data are skewed?

Question

Often introductory applied statistics texts distinguish the mean from the median (often in the the context of descriptive statistics and motivating the summarization of central tendency using the mean, median and mode) by explaining that the mean is sensitive to outliers in sample data and/or to skewed population distributions, and this is used as a justification for an assertion that the median is to be preferred when the data are not symmetrical.

For example:

The best measure of central tendency for a given set of data often depends on the way in which the values are distributed.... When data are not symmetric, the median is often the best measure of central tendency. Because the mean is sensitive to extreme observations, it is pulled in the direction of the outlying data values, and as a result might end up excessively inflated or excessively deflated."
—Pagano and Gauvreau, (2000) Principles of Biostatistics, 2nd ed. (P&G were at hand, BTW, not singling them out per se.)

The authors define "central tendency" thus: "The most commonly investigated characteristic of a set of data is its center, or the point about which observations tend to cluster."

This strikes me as a less-than forthright way of saying only use the median, period, because only using the mean when the data/distributions are symmetrical is the same thing as saying only use the mean when it equals the median. Edit: whuber rightly points out that I am conflating robust measures of central tendency with the median. So it is important to keep in mind that I am discussing the specific framing of the arithmetic mean versus the median in introductory applied statistics (where, mode aside, other measures of central tendency are not motivated).

Rather than judging the utility of the mean by how much it departs from the behavior of the median, ought we not simply understand these as two different measures of centrality? In other words being sensitive to skewness is a feature of the mean. One could just as validly argue "well the median is no good because it is largely insensitive to skewness, so only use it when it equals the mean."

(The mode is quite sensibly not getting involved with this question.)

Personally, I like to include both measures, mean and median, which will give the reader not only some information about central tendency, but also an idea of how skewed the data are. — bdeonovic, May 04 '14 at 21:18
Surely, and adding in the numerical and visual elements of a box, dot and whisker plots moves even a little closer to presenting a distribution itself. — Alexis, May 04 '14 at 21:44
Some context and clarification would improve this question. (1) *In what context* do these (hypothetical) intro texts assert the mean is to be preferred, and for what purpose? (2) Exactly how are these texts "judging the utility of the mean by how much it departs from the behavior of the median"? Could you provide an example or a quotation so we can better understand? — whuber, May 05 '14 at 14:40
In light of the quotation (thank you for providing it!), I do not understand your criticism. The quotation seems clear enough and could hardly be validly termed "less-than-forthright." I would agree that it does need to be interpreted with a little generosity in the sense that "not symmetric" should be taken as "depart from symmetry by an amount that could be important" rather than in the purely mathematical sense (in which almost no dataset is symmetric!). With this *proviso*, the authors are careful to explain *why* one might elect to use a median to describe the center of a dataset. — whuber, May 05 '14 at 18:59
I am afraid I disagree. It seems to me (and I might be misunderstanding) that when they write "Because the mean is sensitive to extreme observations" they mean *precisely* because the mean is not the median (i.e. because it is not robust to outliers). If (1) the mean should only be used when the distribution is symmetric, and (2) when the distribution is symmetric the mean $\approx$ median, that implies (3) only use the median. But they don't come right out and say "don't use the mean" hence "less than forthright." — Alexis, May 05 '14 at 19:09
At one point you misinterpret: the median is not the only statistic that is robust to a few extreme observations. Thus the mean is indicted on the basis of an (often) undesirable characteristic and not by any comparison to the median. But I also get a glimmer of your concern, and perhaps it is related to the implicit conflation of asymmetry and existence of outliers that occurs in this quotation. That is regrettably ill-conceived, because although having outliers sometimes implies asymmetry, the converse is not often true. — whuber, May 05 '14 at 19:15
Insightful! I had not considered the fine point about the distinction between outliers in a sample of a symmetric distribution, and skewness of a (population) distribution. That is something to chew on. — Alexis, May 05 '14 at 19:19
Readers here will find the following thread of interest: [If the mean is so sensitive, why use it in the first place?](http://stats.stackexchange.com/q/14210/7290) — gung - Reinstate Monica, May 05 '14 at 19:56
@Alexis Could you please provide the book's own definition of "central tendency?" This seems critical in assessing the statement about the "best measure of central tendency." — jsk, May 05 '14 at 20:24
In light of the definition given for "central tendency", it seems clear why the mean would not be a useful measure in the presence of skew or outliers. Whether or not you really want to estimate this notion of central tendency seems to be another matter! — jsk, May 05 '14 at 20:58
I don't buy that line of argument: it begs the question of why the median is the preferred definition of central tendency. — Alexis, May 06 '14 at 22:13

Glen_b · Accepted Answer · 2020-05-31T11:11:05.863

I disagree with the advice as a flat out rule. (It's not common to all books.)

The issues are more subtle.

If you're actually interested in making inference about the population mean, the sample mean is at least an unbiased estimator of it, and has a number of other advantages. In fact, see the Gauss-Markov theorem - it's best linear unbiased.

If your variables are heavily skew, the problem comes with 'linear' - in some situations, all linear estimators may be bad, so the best of them may still be unattractive, so an estimator of the mean which is not-linear may be better, but it would require knowing something (or even quite a lot) about the distribution. We don't always have that luxury.

If you're not necessarily interested in inference relating to a population mean ("what's a typical age?", say or whether there's a more general location shift from one population to another, which might be phrased in terms of any location, or even of a test of one variable being stochastically larger than another), then casting that in terms of the population mean is either not necessary or likely counterproductive (in the last case).

So I think it comes down to thinking about:

what are your actual questions? Is population mean even a good thing to be asking about in this situation?
what is the best way to answer the question given the situation (skewness in this case)? Is using sample means the best approach to answering our questions of interest?

It may be that you have questions not directly about population means, but nevertheless sample means are a good way to look at those questions (estimating the population median of a waiting time that you assume to be distributed as ab exponential random variable, for example is better estimated as a particular fraction of the sample mean) ... or vice versa - the question might be about population means but sample means might not be the best way to answer that question.

score 15 · Answer 2 · edited Sep 28 '15 at 21:54

15

In real life, we should choose a measure of central tendency based on what we are trying to find out; and yes, sometimes the mode is the right thing to use. Sometimes it's the Winsorized or trimmed mean. Sometimes the geometric or harmonic mean. Sometimes there is no good measure of central tendency.

Intro books are written badly, they teach that there are cookbook rules to apply.

Take income. This is often very skewed and sometimes has outliers; sure enough, we usually see "median income" reported. But sometimes the outliers and skewness are important. It depends on context and requires thought.

I wrote more on this

edited Sep 28 '15 at 21:54

k-dubs

11
4

answered May 04 '14 at 20:44

Peter Flom

94,055
35
143
276

2

Peter, thank you so much for the link to your post. I wish that the intro texts took the 1 to 2 pages of space necessary to provide as thoughtful a consideration as you provided there. – Alexis May 04 '14 at 20:51
4

I haven't written one but I want to insert a little defence of introductory texts. Any introductory text that tried to give a fully nuanced view that experienced professionals would recognise as such would be flamed by almost all intended recipients; indeed it would not even get published. – Nick Cox May 05 '14 at 17:23
6

A substantive comment: when values are additive such that totals make (e.g.) physical sense, the mean is a a natural summary regardless of the distribution of the individual values. – Nick Cox May 05 '14 at 17:25
Nick, these are precisely what I was thinking. For example, if there are resources or risks attached to, say age, then policy makers might be interested in per-capital costs, which are one way of expressing total (additive) cost. Of course, I do not mean to malign the median and substantive interpretation of it either. – Alexis May 05 '14 at 19:16
3

@NickCox I think that introductory texts can do a lot better than they do. For mean vs. median it's not even a mathematical argument - it's a substantive one. Introductory texts need to tell the person reading them that they are not really qualified to do data analysis. – Peter Flom May 05 '14 at 19:40
@PeterFlom Would you make the same argument about introductory textbooks in other disciplines? If not, then what makes intro stats texts different? – jsk May 05 '14 at 20:47
1

Sure! After "Intro to Clinical Psychology" no one is ready to be a therapist. After one semester of med school - not ready to be a surgeon. Had a year of engineering? Don't design a building yet. Etc. – Peter Flom May 05 '14 at 21:30
1

@PeterFlom It seems I was not clear. Does an ntro to clinical psychology book need to tell someone they will not be ready to be a therapist after reading the book? Does an intro to medicine book need to tell someone they will not be ready to be a surgeon? Does an intro to engineering book need to say they will not be ready to design a building? I would assume this is obvious to people in other fields without being told explicitly? Why do stats students need to be told explicitly? What message do you think intro stats texts send to the reader? – jsk May 07 '14 at 08:56
2

@jsk. Oh, OK. I think they need to be told explicitly in statistics because many people seem to think they are ready after one course in data analysis; indeed, in many fields (psychology, sociology, medicine, etc) people are expected to do data analysis after only 1, 2, or sometimes 3 courses. In PhD programs, for instance, they are expected to write dissertations. Why is it more obvious in other fields? I am not sure. – Peter Flom May 07 '14 at 12:01
1

@PeterFlom I think one reason it is less obvious is that it's a soft skill, not hard skill (like building a bridge or doing surgery), and secondly there are less obvious wrong answers. But I can sympathize a bit with your pov to make it more clear that 1 course of stats 101 does not make you a data analyst. However, it's also an instructor responsibility to get this across (in a subtle way). Which leads me to my question: why didn't you tell your readers on your blog post exactly that? – Georg M. Goerg Mar 21 '16 at 11:37
@PeterFlom Btw, your blog post is wrong: the median in your example of income is also $100, just like the mode (given that lIUC your income table; if i didn't, I apologize, but then would ask you to clarify it a bit better). – Georg M. Goerg Apr 26 '16 at 11:32
@georg Ooops. That's true. But, unfortunately, Yahoo Voices is closed and I can't edit it. – Peter Flom Apr 27 '16 at 11:41

score 9 · Answer 3 · answered May 07 '14 at 07:13

9

Even when data are skewed (e.g., health care costs calculated alongside a clinical trial, where few patients totalled zero cost because they die just after the enrollment, and few patients accrued tons of cost due to side effects of a given health care programme under investigation), mean may be preferred to median for at least one pratical reason: multiplying the mean cost for the number of patients gives health care decision-makers the budget impact of the health care technology under study.

answered May 07 '14 at 07:13

Carlo Lazzaro

704
4
7

1

Echoing Carlo's comment: if you are interested in a population total (e.g., in audit sampling), then you are interested in the mean, period. If makes no difference how skewed or outlier-prone the distribution is, you just have to deal with it. You can't Winsorize, trim, otherwise remove outliers, or log transform. Stratification can help greatly; in the case of extreme outliers, those should be made as strata unto themselves. – BigBendRegion Oct 16 '18 at 11:21

score 3 · Answer 4 · answered May 05 '14 at 04:37

3

I think that what's missing from the question as well as both the answers so far is that the discussion of mean vs median in introductory statistics books generally occurs early on in a chapter about how to numerically summarize a distribution. As opposed to inferential statistics, this is generally about producing descriptive statistics that would be a useful way to convey information about the distribution of the data numerically as opposed to graphically. Contexts in which this arises is the descriptive statistics section of a report or journal article in which there generally is not room for graphical summaries of all the variables in your dataset. If the distribution is skewed, it seems sensible in this context to choose the median over the mean. If the distribution is symmetric without outliers, then the mean is generally preferred over the median as it will be a more efficient estimator.

answered May 05 '14 at 04:37

jsk

2,810
1
12
25

1

Your point about descriptive versus inferential statistics is worthwhile. But you are effectively saying (for descriptive statistics) "only use the mean when it is the same as the median." If the distribution is skewed, then the median does a poor job of representing the concept of *per capita*, right? So isn't it just as valid to take the position "only use the median when it equals the mean?" That's just as arbitrary, and seems to direct attention away from the substantive meaning of these measures (for folks learning them). – Alexis May 05 '14 at 05:23
@Alexis Though the sample mean and sample median are both estimating the same aspect of a symmetric population, they will rarely be equal for a dataset that is approximately symmetric. But yes, for all practical purposes they should be similar in which case it really doesn't matter which one to report. In choosing a numerical summary measure of the distribution to replace a graphical display, the goal is not represent the concept of per capita. In other contexts, I agree with you and the others that the appropriate measure of center should depend on the research question. – jsk May 05 '14 at 07:00
1

The goal is not to represent the concept of per capita? Says who? Why presuppose that's not the goal? – Alexis May 05 '14 at 12:48
@Alexis Perhaps we are thinking about things differently, but I don't think that's any reason to respond rudely. Why don't you try giving an example to illustrate your point instead of acting argumentative and shocked that I could espouse such a view. – jsk May 05 '14 at 16:29
Appologies @jsk, no rudeness or shock intended. "Says who? Why presuppose that's not the goal?" are really the heart of what I am trying to get at: there may surely be situations where per capita is irrelevant just as you say. But isn't the opposite is also true? Please accept my apology for any offense, it was not intendend. – Alexis May 05 '14 at 16:57
1

I don't see any rudeness or "acting shocked" coming from the OP...just sayin'... – Nick Stauner May 05 '14 at 17:53
1

I don't see that it matters whether you are doing inferential or descriptive statistics in this instance. If the appropriate descriptive measure of central tendency is the median, then inferences should be drawn about the median; if the mean, then the mean. If no descriptive measure makes sense, then no inferential measure will make sense either. – Peter Flom May 05 '14 at 18:39
1

@PeterFlom What about in cases where the end goal is not inference? I agree that the appropriateness of a descriptive statistic depends entirely on the reason for producing the statistic. The notion that it is possible that "no descriptive measure makes sense" seems to imply that a descriptive statistic cannot be inherently meaningful. I would argue that in almost all cases, the median makes sense as a measure of the center of the distribution by definition. Whether or not it makes sense for other purposes is another question. – jsk May 05 '14 at 20:14

Should the mean be used when data are skewed?

4 Answers4

Linked