Crash course in robust mean estimation

Question

I have a bunch (around 1000) of estimates and they are all supposed to be estimates of long-run elasticity. A little more than half of these is estimated using method A and the rest using a method B. Somewhere I read something like "I think method B estimates something very different than method A, because the estimates are much (50-60%) higher". My knowledge of robust statistics is next to nothing, so I only calculated the sample means and medians of both samples... and I immediately saw the difference. Method A is very concentrated, the difference between median and mean is very little, but method B sample varied wildly.

I concluded that the outliers and measurement errors skew the method B sample, so I threw away about 50 values (about 15%) that were very inconsistent with theory... and suddenly the means of both samples (including their CI) were very similar. The density plots as well.

(In the quest of eliminating outliers, I looked at the range of sample A and removed all sample points in B that fell outside it.) I would like you to tell me where I could find out some basics of robust estimation of means that would allow me to judge this situation more rigorously. And to have some references. I do not need very deep understanding of various techniques, rather read through a comprehensive survey of the methodology of robust estimation.

I t-tested for significance of mean difference after removing the outliers and the p-value is 0.0559 (t around 1.9), for the full samples the t stat was around 4.5. But that is not really the point, the means can be a bit different, but they should not differ by 50-60% as stated above. And I don't think they do.

What's your intended analysis using this data? The practice of removing outliers is of dubious statistical credibility: you can "make data" to give significance or lack of significance at any level by doing that. Are populations A and B which received measurements using methods A and B truly homogenous populations or is it possible that your methods have just given you different populations? — AdamO, Mar 03 '12 at 18:54
There will be no further calculations or analysis to be done with the data. Both of the methods mentioned are consistent, according to recent research, so the populations should be homogenous; but the data is not of great quality and it is clear some of the values in B are there by mistake (the method is error prone), they make absolutely no economic sense. I _know_ the removal is dubious, that is why I am looking for something more rigorous and credible. — Ondrej, Mar 03 '12 at 19:10

D.W. · Accepted Answer · 2012-03-05T18:17:16.777

Are you looking for the theory, or something practical?

If you are looking for books, here are some that I found helpful:

F.R. Hampel, E.M. Ronchetti, P.J.Rousseeuw, W.A. Stahel, Robust Statistics: The Approach Based on In fluence Functions, John Wiley & Sons, 1986.
P.J. Huber, Robust Statistics, John Wiley & Sons, 1981.
P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, 1987.
R.G. Staudte, S.J. Sheather, Robust Estimation and Testing, John Wiley & Sons, 1990.

If you are looking for practical methods, here are few robust methods of estimating the mean ("estimators of location" is I guess the more principled term):

The median is simple, well-known, and pretty powerful. It has excellent robustness to outliers. The "price" of robustness is about 25%.
The 5%-trimmed average is another possible method. Here you throw away the 5% highest and 5% lowest values, and then take the mean (average) of the result. This is less robust to outliers: as long as no more than 5% of your data points are corrupted, it is good, but if more than 5% are corrupted, it suddenly turns awful (it doesn't degrade gracefully). The "price" of robustness is less than the median, though I don't know what it is exactly.
The Hodges-Lehmann estimator computes the median of the set $\{(x_i+x_j)/2 : 1 \le i \le j \le n\}$ (a set containing $n(n+1)/2$ values), where $x_1,\dots,x_n$ are the observations. This has very good robustness: it can handle corruption of up to about 29% of the data points without totally falling apart. And the "price" of robustness is low: about 5%. It is a plausible alternative to the median.
The interquartile mean is another estimator that is sometimes used. It computes the average of the first and third quartiles, and thus is simple to compute. It has very good robustness: it can tolerate corruption of up to 25% of the data points. However, the "price" of robustness is non-trivial: about 25%. As a result, this seems inferior to the median.
There are many other measures that have been proposed, but the ones above seem reasonable.

In short, I would suggest the median or possibly the Hodges-Lehmann estimator.

P.S. Oh, I should explain what I mean by the "price" of robustness. A robust estimator is designed to still work decently well even if some of your data points have been corrupted or are otherwise outliers. But what if you use a robust estimator on a data set that has no outliers and no corruption? Ideally, we'd like the robust estimator to be as efficient at making use of the data as possible. Here we can measure the efficiency by the standard error (intuitively, the typical amount of error in the estimate produced by the estimator). It is known that if your observations come from a Gaussian distribution (iid), and if you know you won't need robustness, then the mean is optimal: it has the smallest possible estimation error. The "price" of robustness, above, is how much the standard error increases if we apply a particular robust estimator to this situation. A price of robustness of 25% for the median means that the size of the typical estimation error with the median will be about 25% larger than the size of the typical estimation error with the mean. Obviously, the lower the "price" is, the better.

A very well structured and concise answer, thank you! An overview is what I needed, I will read through the paper suggested by Henrik and should be covered. For long summer night entertainment, I will be sure to check out the books suggested by you and jbowman. — Ondrej, Mar 05 '12 at 17:52
@caracal, you are correct. My characterization of the H-L estimator was incorrect. Thanks for the correction. I've updated my answer accordingly. — D.W., Mar 05 '12 at 18:18
Thanks, @gung! I've edited the answer to use 'standard error' as you suggest. — D.W., Mar 05 '12 at 18:21
Good you define *price of robustness*, since I first took it for *relative efficiencies*, which uses variance not standard error,to be interpretable in terms of sample sizes ... — kjetil b halvorsen, Nov 20 '20 at 12:59

score 7 · Answer 2 · answered Mar 05 '12 at 15:29

If you like something short and easy to digest, then have a look at the following paper from the psychological literature:

Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your research. American Psychologist, 63(7), 591–601. doi:10.1037/0003-066X.63.7.591

They mainly rely on the books by Rand R Wilcox (which are admittedly also not too mathematical):

Wilcox, R. R. (2001). Fundamentals of modern statistical methods : substantially improving power and accuracy. New York; Berlin: Springer.
Wilcox, R. R. (2003). Applying contemporary statistical techniques. Amsterdam; Boston: Academic Press.
Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing. Academic Press.

jbowman · Answer 3 · 2012-03-04T03:01:17.430

5

One book that combines theory with practice pretty well is Robust Statistical Methods with R, by Jurečková and Picek. I also like Robust Statistics, by Maronna et al. Both of these may have more math than you'd care for, however. For a more applied tutorial focused on R, this BelVenTutorial pdf may help.

edited Mar 04 '12 at 03:01

answered Mar 03 '12 at 19:20

jbowman

31,550
8
54
107

Ah, prof. Jurečková — a teacher at our university, what are the odds. I will check both of the books. Though I was looking for a more... brief document (since this problem is very marginal for me), it does not hurt to delve into it a little deeper. Thanks! – Ondrej Mar 04 '12 at 00:10
1

It's a small world! Well, at least I corrected the spelling by copying from your comment... – jbowman Mar 04 '12 at 03:02

Crash course in robust mean estimation

3 Answers3

Linked