5

I am doing research on Winsorization (and trimming), which has been broadly applied in many fields, but I think many researchers didn't do it in a "rigorous" way. Or maybe even worse, they misuse it. So I am wondering if there is a well-defined, formalized way to apply Winsorization (or trimming). What appeared in many papers is that the researchers just apply Winsorization when there are some extreme values in their data set. They didn't

  • Justify the mechanism of the extreme values (are they legitimate observations or from other contamination distributions).
  • Follow the framework of robust statistics (make assumptions about the distribution, define the estimator, a.k.a. Winsorized Estimator, and do inference).

In my opinion, when people are talking about "Winsorization", there are two possible meanings:

  • An action to change (Winsorize) the extreme values, but follows a classical statistical inference procedure.
  • An estimator (Winsorized Mean estimator) which is defined as a functional on empirical cdfs: $\hat{\theta}=T(\hat{F}_n)$, and follows robust statistical procedure.

For the second, the data doesn't change; we just change the estimators. But for the first, the data is changed and is regarded as real observations. It is like data manipulation, which should be abandoned.

In this sense, any study follows the first procedure should be regarded as a misuse and should be taken with caution. Can I understand in this way?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Master Shi
  • 643
  • 6
  • 10
  • I vote for Winsorizing (or Winsorising) over Winsori*ation any day and all days. This was the original term and it doesn't need any more syllables. – Nick Cox Oct 29 '19 at 12:49
  • Disapprove or approve, and I disapprove, Winsorizing the data before modelling seems routine in certain areas of finance, where -- if I understand correctly -- the implication is that very extreme changes in say stock prices or traded volumes are typically attributable to bizarre events (e.g. sudden departure of CEO, very bad news about products), which are real but not relevant to the main focus of most projects. In my own areas of interest, if an extreme value is not demonstrably wrong, then it is real and should be in the dataset. I doubt that disapproval from a distance will affect this. – Nick Cox Oct 29 '19 at 12:54

1 Answers1

3

When winsorizing the data, the $\alpha\%$ winsorization is defined as replacing the $(100\%-\alpha)/2$ smallest values with value above them, and $(100\%-\alpha)/2$ largest values with the value below them (with $\alpha\in[0\%, 100\%]$),

for example, a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.

then you apply regular statistical methods to such data, e.g. compute arithmetic mean.

With winsorized mean, you replace the smallest and largest values as above and then compute arithmetic mean. So both approaches are exactly the same, since winsorized mean is defined in terms of winsorizing the data and then taking regular mean. In the first case, you apply function $f$ to the data, and then pass the output through $g$, while in second case, you apply $h(x) = g(f(x))$, so they are mathematically equivalent.

You are right that we can often choose between using "mechanistic" approaches to dealing with outliers (like winsorizing, dropping, or downweighting them), or using end-to-end models that accounts for such data. However, this is not really the case of winsorized mean, since of the reasons outlined above. Example of such end-to-end model would be regression assuming long-tailed distribution for likelihood function, where the model assumes that the data was generated from "outlier"-prone distribution.

Notice also that in many cases in statistics estimators don't depend only on the data, e.g. when using regularization, or priors in Bayesian approach. Even if you use very basic statistical tools, like choosing between empirical mean and median to measure central tendency, while not deciding to ignore the extreme data points, you choose to pay much less attention to them. What I am trying to say, is that the fact that you didn't explicitly transform the data, does not have to mean that you are "letting the data to speak for itself", or that the approach is more "pure" in any sense.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • I agree with you that they are the same in terms of estimation of the mean. But what I want to highlight is that the distribution of the estimators are different. One is the distribution of the arithmetic average estimator, the other is the distribution of winsorized mean estimator. So, even though the value of mean estimate are the same mathematically, the "standard errors" are not the same. Am I right? – Master Shi Feb 18 '19 at 02:37