Dropping outliers based on "2.5 times the RMSE"

Question

In Kahneman and Deaton (2010)$^\dagger$, the authors write the following:

This regression explains 37% of the variance, with a root mean square error (RMSE) of 0.67852. To eliminate outliers and implausible income reports, we dropped observations in which the absolute value of the difference between log income and its prediction exceeded 2.5 times the RMSE.

Is this common practice? What is the intuition behind doing so? It seems somewhat strange to define an outlier based upon a model which may not be well-specified in the first place. Shouldn't the determination of outliers be based on some theoretical grounds for what constitutes a plausible value, rather than how well your model predicts the real values?

$\dagger$: Daniel Kahneman, Angus Deaton (2010): High income improves evaluation of life but not emotional well-being. _{^{Proceedings of the National Academy of Sciences Sep 2010, 107 (38) 16489-16493; DOI: 10.1073/pnas.1011492107}}

When you give a quote from a paper, always give a reference that includes the *page number*. — Ben, Jul 12 '19 at 01:26
I can't say whether this is 'common practice', but I hope not. Automated removals of 'outliers' is fundamentally a bad idea. Maybe your model or removal criterion is not good, maybe there's something new going on (downturn beginning, fresh possibilities awakening) that you shouldn't ignore. // It's different if you can track a suspicious value to data entry error or equipment failure, or if the value is simply off-the-charts absurd (16'2" tall man, guy w/ 61 billable hours last Tuesday, 25min flight SFO-ORD). But not because it doesn't fit a model. I know a startup that went broke that way. — BruceET, Jul 12 '19 at 01:38
The statistical validity of this approach is reflected by the absurd number of decimals they report for the RMSE. — Frans Rodenburg, Jul 12 '19 at 03:55
This feels like a crude / heroic assumption solution to a question I asked a few months ago: https://stats.stackexchange.com/questions/390051/using-regression-weights-when-y-might-be-measured-with-bias — Adrian, Jul 13 '19 at 00:37

Ben · Accepted Answer · 2019-07-13T00:42:19.033

30

The reason for dropping this data is stated right there in the quote: namely, to "eliminate outliers and implausible income reports". The fact that they refer to both of these things in conjunction means that they are conceding that at least some of their outliers are not implausible values, and in any case, they give no argument for why values with a high residual should be considered "implausible" income values. By doing this, they are effectively removing data points because the residuals are higher than what is expected in their regression model. As I have stated in another answers here, this is tantamount to requiring reality to conform to your model assumptions, and ignoring parts of reality that are non-compliant with those assumptions.

Whether or not this is a common practice, it is a terrible practice. It occurs because the outlying data points are hard to deal with, and the analyst is unwilling to model them properly (e.g., by using a model that allows higher kurtosis in the error terms), so they just remove parts of reality that don't conform to their ability to undertake statistical modelling. This practice is statistically undesirable and it leads to inferences that systematically underestimate variance and kurtosis in the error terms. The authors of this paper report that they dropped 3.22% of their data due to the removal of these outliers (p. 16490). Since most of these data points would have been very high incomes, this casts substantial doubt on their ability to make robust conclusions about the effect of high incomes (which is the goal of their paper).

edited Jul 13 '19 at 00:42

answered Jul 12 '19 at 01:41

Ben

91,027
3
150
376

How dare you criticize *the* Daniel Kahneman! Jokes aside, those are very good points +1. – Tim Jul 12 '19 at 06:57
Both authors are recipients of the Nobel Prize. – Nick Cox Jul 12 '19 at 07:09
11

Kahneman is a very fine psychologist, whose books I have generally enjoyed and found helpful. They could each have fifty Nobel prizes --- it wouldn't change the fact that mass removal of "outliers" is a terrible statistical practice. – Ben Jul 12 '19 at 08:19
3

Naturally I agree with you. I didn't think that needed saying. – Nick Cox Jul 12 '19 at 08:24
1

@NickCox You mean the so called ["Nobel Memorial Prize"](https://en.wikipedia.org/wiki/Nobel_Memorial_Prize_in_Economic_Sciences): as I'm sure you know it wasn't established by Nobel and has nothing to do with him really. The official name is apparently "The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel". – amoeba Jul 12 '19 at 09:27
1

You're sure I know that and you are indeed correct. The always authoritative EJMR once carried this posting about me "No, he will never win the Nobel", meaning that prize. – Nick Cox Jul 12 '19 at 10:20

Dropping outliers based on "2.5 times the RMSE"

1 Answers1

Linked