1

Background: I am working with real measurements that likely contain two sources of error, (1) measurements that were performed incorrectly, and (2) natural variability of the measured quantity and measurement sensors since different units of the same instrument were used to make the measurements. The real distribution is not necessarily normal, though I expect it to have a single peak, and I can't discard outliers because I can't consistently identify them. (I don't have duplicated values so I would estimate the mode as the peak point of the distribution of measured values.)

I want to find the typical value of the measured quantity. In the past I've had good results using the median, but a colleague asked why I would not use the mode instead. Is the mode more suitable than the median for noisy, possibly skewed data?

KAE
  • 423
  • 2
  • 15
  • Do you have duplicated values? – Dave Jan 13 '20 at 16:17
  • The answer could be either, neither, or even both are useful. It is context dependent. The right place to start is with a histogram built from as many data points as you can get. That will let you understand the full distribution, and you can go from there. Note you can still estimate modes for a continuous variable by fitting a kernel distribution to the histogram, but how useful they are will depend on the underlying distribution, e.g. imagine a multi-modal distribution with several peaks of similar height. – Robert Alan Greevy Jr PhD Jan 13 '20 at 16:26
  • @Dave, there are no repeated values so I would have to estimate the mode from the distribution peak as Robert mentions. I edited my answer to indicate this. – KAE Jan 13 '20 at 18:27
  • @RobertAlanGreevyJrPhD, Let's assume the distribution has 1 peak and is skewed, so it is not normal. I edited the question to indicate this. Is there any reason I would prefer to use the mode rather than than median to estimate the typical value? The mode would be less sensitive to the long tail, right? So that could be one reason. – KAE Jan 13 '20 at 18:31
  • 1
    How do you determine the "distribution peak," then? Usually it requires some kind of density estimate that is bandwidth-dependent, as illustrated at https://stats.stackexchange.com/a/428083/919. – whuber Jan 13 '20 at 18:32
  • @whuber - Let's say the distribution is fairly well behaved, so it has one peak but a long tail, like the density plots [here](https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution/). I would get the mode by fitting some sensible function to the distribution and finding the peak of the function. My overall question is, for this kind of skewed distribution, is the mode better than the median at obtaining a 'typical value'? – KAE Jan 13 '20 at 18:40
  • 1
    The answer depends on (a) the actual data-generation process; (b) what you mean by "sensible function;" (c) how you fit it; and (d) the sample size. That's why we're probing for details: there's no generic solution or universally correct answer. – whuber Jan 13 '20 at 18:43
  • It may help to consider two hypothetical scenarios. ... – Robert Alan Greevy Jr PhD Jan 13 '20 at 19:54
  • 1
    Scenario 1) Let the natural variability and sensor error create a fairly symmetric, single modal distribution with a small variance -- like a Normal distribution covering a contextually tight range. At this point the mode=median. Then let the incorrect measurement process randomly take 10% of the measurements and shift them dramatically to the right -- shifting all of them to the right of the median. Now the mode is in the same spot but the median has shifted a little to the right. Under this data generating mechanism, you can make a case for the mode as a "typical" value. – Robert Alan Greevy Jr PhD Jan 13 '20 at 19:54
  • Scenario 2) Let the natural variability and the sensor error create a right skewed distribution with a single mode and a large variance. At this point the mode does not equal the median. Then let the incorrect measurement process randomly take 10% of the measurements and shift them dramatically to the right. Again the mode may be stable and the median may shift to the right, but it is questionable whether either measure represents a "typical" value for this distribution. A single number summary may be too simplistic for this setting. – Robert Alan Greevy Jr PhD Jan 13 '20 at 19:54
  • 1
    However, observations that are only a bit erroneous and do not show up as outliers can affect the mode more than the median. – Christian Hennig Jan 13 '20 at 23:30

1 Answers1

0

It may help to consider two hypothetical scenarios.

Scenario 1) Let the natural variability and sensor error create a fairly symmetric, single modal distribution with a small variance -- like a Normal distribution covering a contextually tight range. At this point the mode=median. Then let the incorrect measurement process randomly take 10% of the measurements and shift them dramatically to the right -- shifting all of them to the right of the median. Now the mode is in the same spot but the median has shifted a little to the right. Under this data generating mechanism, you can make a case for the mode as a "typical" value.

Scenario 2) Let the natural variability and the sensor error create a right skewed distribution with a single mode and a large variance. At this point the mode does not equal the median. Then let the incorrect measurement process randomly take 10% of the measurements and shift them dramatically to the right. Again the mode may be stable and the median may shift to the right, but it is questionable whether either measure represents a "typical" value for this distribution. A single number summary may be too simplistic for this setting.

  • Excellent, these scenarios make clear why my colleague suggested using the mode instead of median. For my particular dataset, where Scenario #1 is likely, the mode is more appropriate than the median. Many thanks to all for this helpful discussion – KAE Jan 14 '20 at 12:54