How can I identify and remove outliers in R

Question

I am performing regression analysis on prices of product that we have purchased, based on size and other attributes.

However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.

Each time I run a regression, I will check the 20 with the largest error manually, and 90%+ of the time they will be odd buys like mentioned before, and for my purposes can be completely ignored.

I have been looking into cooks distance to remove these, however I'm not sure how to best set the threshold, or if there is a better method to use.

Could you explain how *any* statistical procedure could possibly hope to succeed in automatically identifying observations that "cannot be addressed directly in the features"? — whuber, Jun 28 '16 at 15:34
As you suggested upon incorporating the effect of known features the remaining series of tentative errors (adjusted Y's) can be examined for @whuber auto-regressive structure (if the data is longitudinal) . Upon adjusting for any time/space significant dependency the resultant can be examined for anomalies( pulses) be they either one-time or seasonal or reflective of a set of contiguous pulses suggestive of a group shift or trend. — IrishStat, Jun 28 '16 at 17:21
There are many difficulties confronting a belief that some values are rogue outliers that don't belong while the other values are fine. (Strong but sensitive political analogies are there for those so inclined.) Even the simple device of using (say) a logarithmic link function for prices that can't be negative is likely to make some values seem less outrageous. For reminders of other strategies that can be used here, see other threads tagged outliers, e.g. http://stats.stackexchange.com/questions/78063/replacing-outliers-with-mean — Nick Cox, Jun 28 '16 at 17:39

Matthew Gunn · Answer 1 · 2016-09-02T17:05:13.197

However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.

Isn't that what error terms in a regression are supposed to capture: variation in the outcome variable that isn't explained by the features of your model?

If your question is how to deal with outliers in general under the assumption that extreme observations are probably bad data

Some standard approaches are:

Trimming the data. Eg. ignore 1% of most extreme observations.
Winsorizing the data. Replace observations above or below some cutoff with the value of the cutoff. (This isn't quite extreme as trimming the data, which deletes extreme observations entirely.)

Some fancier approaches to outliers (ignore if this is at all confusing):

You can do things like ellipsoidal peeling. Find the minimum volume ellipsoid which encloses your data than remove observations along the surface.
Estimate regression with Huber Loss function or something less sensitive to outliers than OLS. Or maybe maximum likelihood estimator with t distributed rather than normal distributed errors, etc...
Quantile regression.
You could adopt some Bayesian view as to whether an observation is bad data.

Beware the problems of mishandling outliers...

In many cases, such as returns for financial securities, removing or ignoring outliers can be hugely problematic. Often times, all the action is in the outliers! Major stock market crashes, company bankruptcies, etc... are hugely important.

For situations involving safety, (eg. auto-crashes etc...), ignoring bad outliers can be even worse! You don't want to winsorize observations such that observations where people die get replaced with observations where people are mildly injured. That would be perhaps criminal negligence.

Trimming need not, and indeed should not, entail deleting (dropping) observations with extreme values. Trimming usually just means ignoring those extremes in particular calculations. Physical deletion would compromise other calculations, e.g. the trimmed mean for $y$ might well ignore quite different observations from those ignore in the trimmed mean for $x$. In the case of regression, it's far from obvious what should be trimmed, especially for one response $y$ and several predictors $X$. — Nick Cox, Jun 28 '16 at 17:26

score 1 · Answer 2 · answered Jun 28 '16 at 16:01

1

I suggest you to follow the steps described in Zuur et al (2010) A protocol for data exploration to avoid common statistical problems. This will help you identify outliers.

answered Jun 28 '16 at 16:01

Mud Warrior

505
1
6
20

The link is broken. If you put in a reference, like an APA reference, then I might be able to find it more easily somewhere else. – EngrStudent Dec 09 '20 at 12:48

score 1 · Answer 3 · answered Jun 28 '16 at 17:14

It sounds from your description that what you are doing is (a) using some automatic process to identify potential outliers (b) examining them one by one using your subject matter knowledge and eliminating those which seem to come from some other process. This seems a sensible procedure but the crucial step is step (b) and so whatever procedure you use in step (a) you really cannot give up step (b) otherwise you are letting the machine pick your model for you when you have information you can use yourself. I realise that this is not going to be much help to you if you have thousands of potential outliers to screen but human intelligence is the way to go here.

score 1 · Answer 4 · answered Jun 28 '16 at 17:51

Another alternative to not removing the outliers would be to use ridge regression.

Ridge regression penalizes outliers reducing their influence when optimizing the regression coefficients. It does have a training parameter lambda that must be tuned using a method like cross-validation to get acceptable results.

score 0 · Answer 5 · answered Mar 01 '20 at 14:53

A common way to remove outliers is the peel-off method (which I learnt from a friend) and which goes like this: you take your set of data points, and construct a convex hull; then you remove the boundary points from your set, and consider constructing the subsequent convex hull ; and then you find how much shrinkage you actually performed in this process of removing data points.

Then, based on your purpose you set some shrinkage threshold and continue this peel-off method, until you reach the shrinkage threshold, and upon reaching you stop.

This is quite easy to do in R, since you have several algorithms available for constructing convex hulls and identifying points on the boundary.

Methods based on this concept indeed can be good for identifying *multivariate* outliers, but this approach usually isn't appropriate for *regression* models. The reason lies in the asymmetric roles played by the explanatory and response variables. — whuber, Mar 01 '20 at 18:41

How can I identify and remove outliers in R

5 Answers5

If your question is how to deal with outliers in general under the assumption that extreme observations are probably bad data

Beware the problems of mishandling outliers...

Linked