33

Winsorizing data means to replace the extreme values of a data set with a certain percentile value from each end, while Trimming or Truncating involves removing those extreme values.

I always see both methods discussed as a viable option to lessen the effect of outliers when computing statistics such as the mean or standard deviation, but I have not seen why one might pick one over the other.

Are there any relative advantages or disadvantages to using Winsorizing or Trimming? Are there certain situations where one method would be preferable? Is one used more often in practice or are they basically interchangeable?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Brian
  • 551
  • 1
  • 5
  • 8
  • 2
    The terminology here is misleading. Trimming means ignoring extreme values, some fraction in each tail. That doesn't imply deletion or dropping of values in the tails, not least because you might, and usually should, include them in other analyses. The term truncation is best reserved for other meanings. See e.g. http://en.wikipedia.org/wiki/Truncation_(statistics) – Nick Cox Mar 19 '14 at 12:05

5 Answers5

12

In a different, but related question on trimming that I just stumbled across, one answer had the following helpful insight into why one might use either winsorizing or trimming:

If you take the trimmed distribution, you explicitly state: I am not interested in outliers/ the tails of the distribution. If you believe that the "outliers" are really outliers (i.e., they do not belong to the distribution, but are of "another kind") then do trimming. If you think they belong to the distribution, but you want to have a less skewed distribution, you could think about winsorising.

I'm curious if there is a more definitive approach, but the above logic sounds reasonable.

Brian
  • 551
  • 1
  • 5
  • 8
4

A good question that is faced very often in all fields! In either case you are technically removing them from the data set.

I know it is common practice when trying to find a trend graphically to use a form of truncation: use the whole data set for plotting purposes, but then exclude the extreme values for the interpretation.

The problem with 'winsorizing' is that the parts you add are self-fullfilling, that is they originate from the data set itself and so just support it. There are simlar problems if you look at cross-validation/classification work in machine-learning, when deciding how to use training and test data sets.

I haven't come across a standardised approach in any case - it is always data specific. You can try finding out which percentile your data (the outliers) are causing a given percentage of the volatility/st. deviation, and find a balance between reducing that volatility but retaining as much of the data as possible.

n1k31t4
  • 541
  • 2
  • 5
  • 17
  • 7
    As in my comment above, "removing them from the data set" is too strong here. Trimming or Winsorizing just means what it does, ignoring or replacing as may be, for a certain calculation. You are not _obliged_ to remove the tail values from the dataset, as if you were throwing out rotten fruit. For example, faced with possible outliers, you might do an analysis of the data as they come and an analysis based on trimming and see what difference it makes. – Nick Cox Mar 19 '14 at 12:13
3

Clearly, the respective merits depend on the data under analysis, and although they depend in non-trivial ways on what actually causes data to be distributed as it is, we can at least consider two extreme cases.

  1. Data is virtually error-free, it just has legitimate outliers, but you don't want your results to be severely affected by them. For instance: in a distribution of wealth, there are horribly rich and horribly indebted people that would bear an excessive weight in your estimates. Now, you don't necessarily want to ignore these people, you just want to ignore they are so rich, or indebted. By winsorizing, you treat them as "reasonably rich" or "reasonably indebted". (Notice that in this specific example if you were only looking at positive wealth, taking a logarithm might be preferable)

  2. The underlying distribution is nice, possibly normal, but there are (few but relevant) errors in the data and you know it is only such errors that cause the outliers. For instance: in a distribution of reported salaries, a few survey participants might have mistyped, or reported in the wrong currency, their own salary, resulting in unreasonable amounts. By trimming, you exclude outliers because they really don't provide useful information, they are just noise (notice you will have some noise left in the distribution, but at least you remove the noise that would disproportionately distort your analysis).

Then, outliers in real data are often a mixture of data error and legitimate extreme values, which it is not obvious to interpret.

The recommendation to always parallel your winsorized/trimmed results with the full results is always valid, but for two slightly different reasons. In the first case, to warn the reader that you do not claim you are talking about the actual distribution: rather, you study a modified distribution which de-emphasizes extreme values. In the second case, because you claim you are talking about the actual distribution, but you must warn the reader you more or less arbitrarily decided what in the data was actually noise, not information.

On a more subjective note, trimmed results (and the difference with full results) are often easier to describe correctly, and to grasp intuitively, than winsorized results.

-1

This is a good question, and one I have been faced with. In cases where you have a large dataset or a more accurately a largely varying dataset, where the minority of data values vary across a wide scale (but nevertheless are required to be shown), and the majority of the dataset is within a narrow band, such that if the data is plotted as is, the details where the majority of the data lie are lost, and normalizing or standardizing does not show adequate differentiation(at least visually), or, raw data is required instead, then truncating or winsorizing the extreme data values helps for better data visualization.

guest
  • 1
  • It's a good question, but you don't answer it. You just say that truncating or Winsorizing can help visualization. – Nick Cox Jan 23 '16 at 17:33
-2

One advantage of Winsorizing is that the calculation may be more efficient. In order to calculate a true truncated mean, you need to sort all of the data elements, and that is typically $O(n \log n)$. However there are efficient ways of figuring out just the 25% and 75% percentiles using a the quick select algorithm, which is typically $O(n)$. If you know these end points, you can quickly loop over the data again, and replace values less than 25% with the 25% value and more than 75% with 75% and average. This is identical to the Winsor mean. But looping over the data and only averaging data between the 25% value and 75% value is NOT identical to the truncated mean, because the 25% or 75% values may not be a unique value. Consider the data sequence $(1,2,3,4,4)$. The Winsor mean is $(2+2+3+4+4)/5$. The correct truncated mean should be $(2+3+4)/3$. The "quick-select" optimized truncated mean will be $(2+3+4+4)/4$.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Mark Lakata
  • 137
  • 4
  • 1
    It is not the case that you need to sort all the data to compute a median (as true a median as you like), nor is it true that it's an $O(n\log n)$ calculation to find it. There are algorithms for finding the median that are $O(n)$ (worst case). [Further, if quick select could find the 25th and 75th percentiles in O(n) as you say, why would quick select be unable to find the 50th percentile in the same order?] – Glen_b Sep 22 '14 at 23:18
  • You are correct. I mistyped my original post. Sometimes the typing fingers and brain are not in sync. I meant to say to correctly calculate a true *truncated mean*, you need to sort all of the data elements. I believe this is still true. I've updated by answer. – Mark Lakata Sep 23 '14 at 22:14
  • 2
    This seems to imply that Winsorizing means Winsorizing 25% in each tail. You can Winsorize as much or as little as seems appropriate. – Nick Cox Sep 23 '14 at 22:40