0

Wondering if any general guidelines whether using Mean or Median is better to represent the statistics of underlying data? I think using Median is always better, especially better when standard deviation is large for a data set.

Your advice is appreciated.

Lin Ma
  • 227
  • 3
  • 14
  • 2
    Neither the mean nor the median are always better than the other. The median is a more robust estimator (median had breakdown point of 50% while the mean has breakdown of 0%)which is important if you have outliers, or data that's not continuous. However there are also (many) cases you would want to use the mean. A simple example of when the mean is better is if I'm interested in how far I drove today, and I know how many hours I drove, then I would multiply my average speed by hours, the median would not be as helpful. Not to mention if you're analyzing Normally distributed data, use the mean – RAND Aug 12 '16 at 17:23
  • @RMontgomery, thanks and vote up. I am confused by what do you mean "the mean has breakdown of 0%". I think mean is not as accurate as median which has exact 50% break down, but it should be in the middle, other than breakdown 0%? :) – Lin Ma Aug 12 '16 at 17:55
  • @RMontgomery, another thought in my mind is, if I could draw both mean and median for discrete data set, then if they diff a lot, there are outliers? Thanks. – Lin Ma Aug 12 '16 at 17:56
  • 1
    The breakdown point can be thought of (intuitively) as the proportion of your data that can be incorrect before it significantly impacts your estimator. If I have 100 data points, and one of them was corrupted and entered as infinity then my mean would be infinity. Thus the breakdown point of the arithmetic mean is 0. The median would have to have 50% of its data corrupted before the estimator was affected, thus it has a breakdown point of 50%. – RAND Aug 12 '16 at 18:01
  • 1
    @LinMa as to your other question you would need to quantify "a lot" but I suppose that wouldn't be a bad conjecture. If they mean and median differ significantly there has to be at least one observation pulling the mean away from the median. – RAND Aug 12 '16 at 18:07
  • 1
    Also see http://stats.stackexchange.com/questions/2547 – whuber Aug 12 '16 at 18:25
  • @whuber, thanks for the reference, and vote up for your reply. From your reply, I think it means median is always better than mean? If my understanding is wrong from the post, it will be great if you could show when using mean is better than median. – Lin Ma Aug 14 '16 at 22:10
  • @RMontgomery, thanks for your comments and vote up all of your replies. Sorry for the late response. After reading your analysis and good examples, if my purpose is to analyze data distribution, I think median is always better than mean? I only find if we need arithmetic meaning, then mean is probably good (in your example if driving distance)? If there are some examples when using mean is better than median for data distribution analysis, appreciate if you could share and correct me. Thanks. – Lin Ma Aug 14 '16 at 22:13
  • 1
    You must have misunderstood, because I would not venture to make such a blanket statement as "this statistic is always better than that one." It depends on the data and on how the statistic will be used. I would like to suggest that instead of asking "which is better," you focus on "how can I recognize when to use each one." – whuber Aug 14 '16 at 22:50
  • @whuber, thanks for the correction, is there an example, mean is better than median in some perspective? – Lin Ma Aug 14 '16 at 23:07
  • 2
    I see examples in the answer Matthew Gunn posted in this thread! Here's another: suppose you had a set of quantities of a long-lasting drug administered in widely varying doses to a medical patient and your purpose was to estimate how much of that drug they had consumed on average. The median could be misleading whereas the mean is directly related to what you want to know. Here's another: you would like to estimate a typical salary earned by a writer during their career. Most years she earned nothing, but in a few years earned millions from best-sellers. The median (zero) would be misleading. – whuber Aug 15 '16 at 14:27
  • 1
    @LinMa In my opinion your assertion that the median is always better than the mean is absolutely incorrect. Matthew Gunn has already given two perfectly reasonable examples where the mean is a better estimator than the median. As a general piece of advice you will probably never come across an estimator or statistical technique that is superior to all others in every context. Take for example LSE's for simple linear regression, they are the best unbiased estimators, but they can be outperformed by biased estimators. Don't always use one estimator or technique do what is applicable to your data – RAND Aug 15 '16 at 14:29
  • @whuber, love your 2nd example for the salary example. But your first example on long-lasting drug does not persuade me a strong reason mean is better? Maybe I mis-read, would you mind to elaborate a bit more just for the first example? Vote up for your patience to reply. – Lin Ma Aug 16 '16 at 03:57
  • @RMontgomery, thanks for the answer and vote up. I withdraw my conclusion of median is always better than mean. But I think I can make another conclusion, differences between median and mean, describes how far the data set is from a uniform distribution, is that correct conclusion now? :) – Lin Ma Aug 16 '16 at 04:01

1 Answers1

6

To give a counter-example, you almost certainly want the mean and not the median when calculating returns in finance.

Examples where the median is horribly misleading compared to the mean:

  • If you're looking at a set of bond returns, the median will effectively ignore those observations where the bond defaults and you lose half or more of your money.
  • If you're looking at venture capital returns, it's in some sense the reverse. The median company in VC or angel investing is a bust, and the median will effectively ignore the big winners like Google. The return for Ron Conway's first angel fund came largely from one company, Google.

Sometimes insensitivity to outliers is NOT what you want!

Good luck explaining to investors, "I know our fund is down 40 percent this year because nearly half our bonds went bust with no recovery, but our median bond is returning one percent!"

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • Thanks Matt for the advice and vote up. I am not finance expert and I trust all you mentioned. Actually I want to draw correlation between kids age and weight, to see the trend/correlation, and how correlated. In each age (integer), there are 10-20 kids, I only want to draw one point per age for all kids, I think using median is better? If you could advise for my specific use case, it will be great. – Lin Ma Aug 12 '16 at 17:49
  • 1
    @LinMa In a kind of ambiguous setting where it's unclear what to use, why not calculate both? Use ordinary least squares regression to estimate a conditional expectation function (i.e. linear function of the mean) and use quantile regression to estimate a linear function for quantiles (eg. the median, 80th percentile etc...). – Matthew Gunn Aug 12 '16 at 17:56
  • Thanks Matt, vote up for your reply. In your funds return example, why using mean is better than median? I think good performance 10% funds might be far from mean, I do not see specific values using mean other than using median to find valuable funds to invest. Maybe my understanding is wrong, please feel free to correct me. If you could elaborate a bit more, it will be great. – Lin Ma Aug 14 '16 at 22:14
  • Hi Matt, do your example about bonds, "If you're looking at a set of bond returns, the median will effectively ignore those observations where the bond defaults and you lose half or more of your money.", do you mean a few bonds have good return, while remaining are not of good return? I am a bit confused since I am not investing bonds. But I heard bonds return are very common across bonds, in this case, median should be the same as mean. – Lin Ma Aug 16 '16 at 04:05
  • 1
    @LInMa Imagine you have 5 bonds returns where 4 return 2% and 1 defaults with no recovery. The returns are {-100%, 2%, 2%, 2%, 2%}. The median is 2%. The mean is -18.4%. – Matthew Gunn Aug 16 '16 at 12:26
  • Thanks Matt for the patience to explain and vote up. I have limited finance knowledge as you have and not aware bonds could lose money before you remind me. :) In your example, I think median makes more sense to select good bonds, correct? But why you said mean is better? – Lin Ma Aug 17 '16 at 07:16
  • 1
    @LinMa No. Imagine asset A has an equal probability of returning {-100%, 4.1%, or 4.2%}. Imagine asset B returns 4% with certainty. The median return of Asset A is 4.1%. The median return of Asset B is 4%. Which one do you want? – Matthew Gunn Aug 17 '16 at 15:07
  • Thanks Matt, vote up, so Median will ignore some extreme case? This is the disadvantage? – Lin Ma Aug 18 '16 at 05:30
  • Thanks for all the help Matt, mark your reply as answer. – Lin Ma Aug 28 '16 at 22:44