90

What do you call an average that does not include outliers?

For example if you have a set:

{90,89,92,91,5} avg = 73.4

but excluding the outlier (5) we have

{90,89,92,91(,5)} avg = 90.5

How do you describe this average in statistics?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Tawani
  • 1,003
  • 1
  • 7
  • 5
  • https://sciencing.com/calculate-outliers-5201412.html I felt the above link surely has answered the question. – Sam Feb 20 '18 at 08:20
  • 3
    This depends how the assumed outliers are defined. It could be a trimmed mean or a Winsorized mean or some other form of robust estimate of location. – Michael R. Chernick Feb 20 '18 at 08:47
  • When I saw the title of this question, I was hoping to find a punchline here.... – ashleedawg Mar 02 '20 at 22:34

15 Answers15

76

It's called the trimmed mean. Basically what you do is compute the mean of the middle 80% of your data, ignoring the top and bottom 10%. Of course, these numbers can vary, but that's the general idea.

ashleedawg
  • 103
  • 4
dsimcha
  • 7,375
  • 7
  • 32
  • 29
  • 13
    Using a rule like "biggest 10%" doesn't make sense. What if there are no outliers? The 10% rule would eliminate some data anyway. Unacceptable. –  Feb 02 '09 at 14:45
  • 4
    See my answer for a statistically-significant way to decide which data qualify as an "outlier." –  Feb 02 '09 at 14:46
  • 7
    Well, there's no rigorous definition of outlier. As for your response, if there are outliers they will affect your estimate of the standard deviation. Furthermore, standard deviation can be a bad measure of dispersion for non-normally distributed data. – dsimcha Feb 02 '09 at 14:47
  • True there's no rigorous definition, but eliminating based on percentile is certainly wrong in many common cases, including the example given in the question. –  Feb 02 '09 at 14:50
  • Also, outliers will not affect standard deviation much. Unless there are many of them, in which case they aren't outliers! You might for example have a bi-modal or linearly random distribution, but then throwing out data is wrong, and indeed the notion of "average" might be wrong. –  Feb 02 '09 at 14:51
  • 1
    The trimmed mean also enjoys the benefit of including the median as a limiting case, ie, when you trim 50% of data on both sides. – Andrew M Dec 03 '14 at 16:55
  • 1
    **This answer is incorrect:** since only one (low) value was discarded, the result has not been "trimmed," which by definition removes equal numbers of values at both ends of the data distribution. – whuber Feb 20 '15 at 14:39
  • 1
    @whuber Not so. The literature certainly includes trimmed means where the proportions are unequal in each tail, including the case of zero in one tail. Examples are prominent in http://onlinelibrary.wiley.com/book/10.1002/9781118165485 It is a reasonable convention to use equal proportions (a) wherever distributions are approximately symmetric (b) in the absence of a rationale for doing otherwise, but that is not the only possible definition of a trimmed mean. Clearly analysis and interpretation need to account for any differences in proportions. – Nick Cox Jan 08 '16 at 13:17
  • 2
    @Nick Thank you for the clarification. I would go further, though, and suggest that unless that one "outlier" was excluded due to considerations that (a) were independent of the observed distribution of the data and (b) *a priori* suggested 20% trimming of the low end, then it would be misleading to characterize the process in the question as a "trimming" procedure. It looks like outlier detection and rejection, pure and simple. Although the *result* may look the same, as *statistical procedures* the two processes of trimming and outlier removal are very different. – whuber Jan 08 '16 at 14:24
  • 2
    @whuber I agree; personally I wouldn't use _trimming_ to describe what is in effect an outlier removal approach based on some other criterion, including visceral guesses. But the distinction is in the mind of the beholder: there is a difference between "for data like this, trimming 5% in each tail seems a good idea" and "I've looked at the data and the top 5% are probably best ignored", etc. The formulas don't know the analyst's attitudes, but the latter are the researcher's justification for what is done. – Nick Cox Jan 08 '16 at 14:32
  • 1
    The trimming here was one-sided. If you would trim from both the top and bottom, you would remove 92 also cutting out 40% of the distribution. – Michael R. Chernick Dec 24 '16 at 15:00
45

A statistically sensible approach is to use a standard deviation cut-off.

For example, remove any results +/-3 standard deviations.

Using a rule like "biggest 10%" doesn't make sense. What if there are no outliers? The 10% rule would eliminate some data anyway. Unacceptable.

  • 2
    I was going to say this approach doesn't work (pathological case = 1000 numbers between -1 and +1, and then a single outlier of value +10000) because an outlier can bias the mean so that none of the results are within 3 stddev of the mean, but it looks like mathematically it *does* work. – Jason S Feb 02 '09 at 15:21
  • It's not at all hard to prove that there has to be at least one data point within one standard deviation (inclusive) of the mean. Any outlier big enough to pull the mean way out is going to enlarge the standard deviation a lot. –  Feb 02 '09 at 17:47
  • 3
    http://en.wikipedia.org/wiki/Chebychev%27s_inequality This applies regardless of the distribution. – dsimcha Feb 02 '09 at 20:49
  • ooh! thanks dsimcha! Chebyshev is one of my math heroes (mostly for function approximations). – Jason S Feb 02 '09 at 21:10
  • 8
    The problem is that "outlier" isn't post-hoc conclusion about a particular realized data set. It's hard to know what people mean by outlier without knowing what the purpose of their proposed mean statistic is. –  Mar 03 '09 at 20:11
  • 6
    So your categorial statement of "unacceptable" is non-sense, and not really very helpful. The trimmed mean has some useful properties, and some less useful, like any statistic. –  Mar 03 '09 at 20:12
  • @Gregg: I agree with you. Your statement is more accurate than mine. However I still contend that generally it's more useful to depend on dispersion rather than percentile. –  Mar 03 '09 at 22:14
  • 3
    Note that contrary to comments elsewhere in this thread, such a procedure is not associated with statistical significance. – Nick Cox Dec 03 '14 at 16:51
  • For the applications where I've used trimmed mean this approach wouldn't be an improvement. The problem is that outliers can skew the standard deviation and mean significantly, so the range of values included will be biased giving a biased result. Using percentiles to determine the range of data is more robust and consistent from run to run, which is typically what you are looking for when using a trimmed mean. – pavon Sep 24 '21 at 23:25
22

Another standard test for identifying outliers is to use LQ $-$ (1.5$\times$IQR) and UQ $+$ (1.5$\times$ IQR). This is somewhat easier than computing the standard deviation and more general since it doesn't make any assumptions about the underlying data being from a normal distribution.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Mark Lavin
  • 321
  • 1
  • 2
  • 1
    But if it doesn't make any assumption, what is it then based on ? It must at least something like a definition of an outlier ? –  Mar 26 '16 at 10:15
  • the formula is quartile based, so it's dependent on median rather than mean – arahant Jan 13 '19 at 05:28
  • The 1.5 multiplier raises a question, why 1.5? And apparently it is somewhat based on normal distribution. If you apply this, directly on a guassian distribution, you get: 0.675σ + 1.5 * (0.675 - [-0.675])σ = 0.675σ + 1.5 * 1.35σ = 2.7σ which is an acceptable range to mark as "outliers". reference: https://medium.com/mytake/why-1-5-in-iqr-method-of-outlier-detection-5d07fdc82097 – Munawwar Apr 26 '20 at 19:13
20

The "average" you're talking about is actually called the "mean".

It's not exactly answering your question, but a different statistic which is not affected by outliers is the median, that is, the middle number.

{90,89,92,91,5} mean: 73.4
{90,89,92,91,5} median: 90

This might be useful to you, I dunno.

  • 3
    You are all missing the point. It has nothing to do with the mean, median, mode, stdev etc. Consider this: you have {1,1,2,3,2,400} avg = 68.17 but what we want is: {1,1,2,3,2,400} avg = 1.8 //minus the [400] value What do you call that? –  Feb 02 '09 at 15:41
  • 6
    @Tawani - they are not all missing the point. What you say needs to be defined using generic terms. You cannot go with a single example. Without general definitions, if 400 is 30 is it still an outlier? And if it is 14? And 9? Where do you stop? You need stddev's, ranges, quartiles, to do that. –  Feb 02 '09 at 17:05
19

For a very specific name, you'll need to specify the mechanism for outlier rejection. One general term is "robust".

dsimcha mentions one approach: trimming. Another is clipping: all values outside a known-good range are discarded.

9

There is no official name because of the various mechanisms, such as Q test, used to get rid of outliers.

Removing outliers is called trimming.

No program I have ever used has average() with an integrated trim()

  • 5
    `mean()` in R has a trim argument http://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html – Jeromy Anglim Sep 29 '11 at 11:55
  • 2
    In trimming you don't remove outliers; you just don't include them in the calculation. "Remove" might suggest that points are no longer in the dataset. And you don't remove (or ignore) them because they are outliers; the criterion is (usually) just that they are in some extreme fraction of the data. A value not included in a trimmed mean often is only slightly more (or less) than the highest (lowest) value included. – Nick Cox Dec 03 '14 at 16:48
7

I don't know if it has a name, but you could easily come up with a number of algorithms to reject outliers:

  1. Find all numbers between the 10th and 90th percentiles (do this by sorting then rejecting the first $N/10$ and last $N/10$ numbers) and take the mean value of the remaining values.

  2. Sort values, reject high and low values as long as by doing so, the mean/standard deviation change more than $X\%$.

  3. Sort values, reject high and low values as long as by doing so, the values in question are more than $K$ standard deviations from the mean.

sashkello
  • 2,198
  • 1
  • 20
  • 26
Jason S
  • 255
  • 1
  • 7
4

... {90,89,92,91(,5)} avg = 90.5

How do you describe this average in statistics? ...

There's no special designation for that method. Call it any name you want, provided that you always tell the audience how you arrived at your result, and you have the outliers in hand to show them if they request (and believe me: they will request).

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
4

The most common way of having a Robust (the usual word meaning resistant to bad data) average is to use the median. This is just the middle value in the sorted list (of half way between the middle two values), so for your example it would be 90.5 = half way between 90 and 91.

If you want to get really into robust statistics (such as robust estimates of standard deviation etc) I would recommend a lost of the code at The AGORAS group but this may be too advanced for your purposes.

3

If all you have is one variable (as you imply) I think some of the respondents above are being over critical of your approach. Certainly other methods that look at things like leverage are more statistically sound; however that implies you are doing modeling of some sort. If you just have for example scores on a test or age of senior citizens (plausible cases of your example) I think it is practical and reasonable to be suspicious of the outlier you bring up. You could look at the overall mean and the trimmed mean and see how much it changes, but that will be a function of your sample size and the deviation from the mean for your outliers.

With egregious outliers like that, you would certainly want to look into te data generating process to figure out why that's the case. Is it a data entry or administrative fluke? If so and it is likely unrelated to actual true value (that is unobserved) it seems to me perfectly fine to trim. If it is a true value as far as you can tell you may not be able to remove unless you are explicit in your analysis about it.

robin.datadrivers
  • 2,503
  • 11
  • 16
1

I love the discussion here - the trimmed mean is a powerful tool to get a central tendency estimate concentrated around the middle of the data.

The one thing I would add is that there is a choice to be made about which "metric" to use in the cases of small and large sample sizes. In some cases we talk about

  • means in the context of large samples because of central-limit theorem,
  • medians as robust small-sample alternatives
  • and trimmed means as robust to outliers.

Obviously the above is a gross generalization, but there are interesting papers that talk about the families and classes of estimators in large and small sample settings and their properties. I work in bioinformatics aand usually you deal with small samples (3-10s) usually in mice models, and what not, and this paper gives a good technical overview of what alternatives exist and what properties these estimators have.

Robust estimation in very small samples

This is off-course one paper, but there are plenty others that discuss these types of estimators. Hope this helps.

0

disclaimer - this method is ad hoc and without rigorous study. Use at your own risk :)

What I found to be quite good was to reduce the relevancy of a points contribution to the mean by the square of its number of standard deviations from the mean but only if the point is more than one standard deviation from the mean.

Steps:

  1. Calculate the mean and standard deviation as usual.
  2. Recalculate the mean, but this time, for each value, if it is more than one standard deviation from the mean reduce its contribution to the mean. To do reduce its contribution, divide its value by the square of its number of deviations before adding to the total. Also because it's contributing less, we need to Reduce N, so subtract 1-1/(square of values deviation) from N.
  3. Recalculate the standard deviation, but use this new mean rather than the old mean.

example: stddev = 0.5 mean = 10 value = 11

then, deviations = distance from mean / stddev = |10-11|/0.5 = 2

so value changes from 11 to 11/(2)^2 = 11/4

also N changes, it is reduced to N-3/4.

code:

def mean(data):
    """Return the sample arithmetic mean of data."""
    n = len(data)
    if n < 1:
        raise ValueError('mean requires at least one data point')
    return 1.0*sum(data)/n # in Python 2 use sum(data)/float(n)

def _ss(data):
    """Return sum of square deviations of sequence data."""
    c = mean(data)
    ss = sum((x-c)**2 for x in data)
    return ss, c

def stddev(data, ddof=0):
    """Calculates the population standard deviation
    by default; specify ddof=1 to compute the sample
    standard deviation."""
    n = len(data)
    if n < 2:
        raise ValueError('variance requires at least two data points')
    ss, c = _ss(data)
    pvar = ss/(n-ddof)
    return pvar**0.5, c

def rob_adjusted_mean(values, s, m):
    n = 0.0
    tot = 0.0
    for v in values:
        diff = abs(v - m)
        deviations = diff / s
        if deviations > 1:
            #it's an outlier, so reduce its relevancy / weighting by square of its number of deviations
            n += 1.0/deviations**2
            tot += v/deviations**2
        else:
            n += 1
            tot += v
    return tot/n

def rob_adjusted_ss(values, s, m):
    """Return sum of square deviations of sequence data."""
    c = rob_adjusted_mean(values, s, m)
    ss = sum((x-c)**2 for x in values)
    return ss, c

def rob_adjusted_stddev(data, s, m, ddof=0):
    """Calculates the population standard deviation
    by default; specify ddof=1 to compute the sample
    standard deviation."""
    n = len(data)
    if n < 2:
        raise ValueError('variance requires at least two data points')
    ss, c = rob_adjusted_ss(data, s, m)
    pvar = ss/(n-ddof)
    return pvar**0.5, c

s, m = stddev(values,ddof=1)
print s, m
s, m = rob_adjusted_stddev(values, s, m, ddof=1)
print s, m

output before and after adjustment of my 50 measurements:

0.0409789841609 139.04222
0.0425867309757 139.030745443

enter image description here

robert king
  • 101
  • 2
  • 4
    Why might this be better than traditional methods? – Michael R. Chernick Mar 08 '18 at 19:23
  • 3
    Thanks, I'm not familiar w/ this approach. Dividing by the square of a deviation might produce unusual results when the deviations are – gung - Reinstate Monica Mar 08 '18 at 20:32
  • I mentioned to only do it for values where the standard deviation is greater than 1, which according to Chebyshev's inequality, it's not very often that values will be drastically effected. – robert king Mar 08 '18 at 20:34
  • I'm not sure if this technique has been used before, I'd be surprised if it hasn't be used before as it seems fairly intuitive. I'm using it to notify factory workers of violations of nelson rules on products in production lines. It seems to reduce the number of violations reported but seems to still find the critical violations. Nelson rules concerns multiple values in a row being above or below 1 stddev, or smaller numbers of points being 2stddev or 3stddev. @MichaelChernick i'm not familiar with traditional methods, Winsorized looks interesting, may give diff results in black swan events. – robert king Mar 08 '18 at 20:44
  • I didn't mean about the *number* of SDs, exactly. Imagine a case where the SD = .3, & a deviation is .54. Then the deviation is >1SD, but when you divide by the square of the SD, you divide .54/.3^2 = .54/.09 = 6. Ie, the deviation is now larger b/c of the adjustment, rather than having been made smaller. – gung - Reinstate Monica Mar 09 '18 at 01:33
  • 5
    Although this procedure is *ad hoc*, in spirit it is much like an [M-estimator](https://en.wikipedia.org/wiki/M-estimator). One reason for the comments you are getting is that the properties of procedures like this can be analyzed and studied and that, in general, the lack of such study shows the procedure is not well understood and likely is inferior to others. Thus, it is incumbent on anyone proposing a new procedure to characterize its properties sufficiently to enable intelligent, correct application of it. Absent such characterization, readers ought to be reluctant to adopt it. – whuber Mar 09 '18 at 15:13
  • @gung I think i had a typo - by deviation I meant the number of standard deviations. so if the SD is .3, a value must be more than .3 from the mean to be effected. if the values distance from the mean is .54, then its deviations is .54/.3 = 1.8, and so we would divide by 1.8^2 = 3.24 and so the value will be 1/3.24 as important as it was previously. – robert king Mar 09 '18 at 22:21
  • @whuber you're right this procedure is ad hoc. you're right readers ought to be reluctant to adopt it. I'm enjoying the comments :) It would be cool if someone did find a problem with my method :) - i'll add a disclaimer to the answer – robert king Mar 09 '18 at 22:25
  • 1
    I admire your attitude (seriously!). Do note, however, that the burden of proof is on you. It's your job to demonstrate the correctness or usefulness of your recommendation (either through citation or a legitimate argument). It's not incumbent on us to perform that analysis. I have pointed to a theory that gives you some hope this procedure has good properties, but it's a general--yet extremely effective--meta-law of statistics that *ad hoc* procedures are inadmissible until proven otherwise (which simply means there is some other procedure that works better). – whuber Mar 09 '18 at 22:30
  • 1
    Thanks for the clarification, that makes a lot more sense. – gung - Reinstate Monica Mar 10 '18 at 01:46
0

There are superior methods to the IQR or SD based methods. Due to outliers being present, the distribution likely has issues with normality already (unless ouliers are evenly distributed at both ends of the distribution). This inflates the SD a lot, making the SDs use less than desirable, however the SD method has some desirable aspects over the IQR method, namely 1.5 times the IQR is a relatively subjective cutoff. While subjectivity in these matters is unavoidable it is preferable to reduce it.

A Hampel Identifier on the other hand uses robust methods to estimate outliers. Essentially its the same as the SD method, but you would replace means with medians and SD with Median Absolute Deviations (MAD). MADs are just the median distance from the media. This MAD is multiplied by a scaling constant .675. The formula comes out to (X - Median)/(.675*MAD). The resulting statistic is treated identically to a Z-score. This bypasses the issue of the likely non-normality that if you have outliers may be present.

As for what to call it. Trimmed means are normally reserved for the method of trimming the bottom and top ten percent mentioned by @dsimcha. If it has been completely cleaned you may refer to it as the cleaned mean, or just the mean. Just be sure to be clear what you did to it in your write-up.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust Statistics. John Wiley & Sons, New York.

NuclAcc
  • 66
  • 5
-4

My statistics textbook refers to this as a Sample Mean as opposed to a Population Mean. Sample implies there was a restriction applied to the full dataset, though no modification (removal) to the dataset was made.

Mike
  • 1
-4

It can be the median. Not always, but sometimes. I have no idea what it is called in other occasions. Hope this helped. (At least a little.)

Samster
  • 1
  • 1
  • 2