OLS robust to outliers

Question

I am facing the following problem: I have a training sample and estimate a model on that training sample. My model is simply OLS: $y_t = a + \beta x_t + \varepsilon_t$. The model is estimated on points in set $t\in T$. The training sample contains well behaved data. When forecasting with this model out of sample, there may be points that are poorly measured and thus take on extreme values. I would like to prevent my model from forecasting extreme output values at times when poorly measured points occur. Thus for points $t \notin T$, I would like my coefficients $(\alpha, \beta)$ to be less sensitive to extremes. I think the appropriate thing is to transform the data in some way (maybe through $ln$)? Maybe Box-Cox?

Let me illustrate. Imagine you have a sensor which functions normally 99.9% of the time but .1% of the time generates a random extreme value that has nothing to do with the measurement. Unless your training set includes that point, you are unable to tailor a model around it. However, you would like not to generate an extreme prediction out of sample when that .1% occurs.

I would like to know what the standard techniques are for dealing with this problem. Please provide some references as well if possible.

@RobertSmith: I'm not sure I understand your question. Say i get $\hat{\alpha}=0$ and $\hat{\beta}=1$ ... I'm not sure how this helps us though. The issue is that in out-of-sample, I maybe get $x_t$ that is poorly measured (as in 10000 for example). I don't want the model to predict that $\hat{y_t} = 10000$ as well since I know that true $x$ could have been close to 1, for example. — Alex, Sep 24 '13 at 20:13
I asked because I wanted to know whether your model was overfitting or not. You say you get a large $y_{t}$ for a large $x_{t}$, so I imagine you have a regression with a really steep line. If the training data is representative of the test data and the outliers are the only issue here, you could preprocess your data with an outlier detector algorithm in order to avoid those poorly measured values. Another suggestion is to add artificial outliers to the training set, however, this will increase the error for most values that, I presume, are already being predicted correctly. — Robert Smith, Sep 24 '13 at 20:31
@RobertSmith: Thank you for clarifying. Overfitting isn't the issue... it's detecting when a new $x$ has arrived that should be thrown out or shrunk to the mean rather than used. I find your idea of adding outliers to the training data interesting. One still needs to somehow "shrink" these training outliers, hence my interest in Box-Cox transform or some variant. — Alex, Sep 24 '13 at 20:36
If the difference between "good" values and "bad" values is very noticeable (1 vs 10000), I think the first suggestion is easier and better, particularly if you're doing well predicting "good" values. — Robert Smith, Sep 24 '13 at 20:46
@RobertSmith: What would you suggest a good outlier detection algorithm to be? References would help here. — Alex, Sep 24 '13 at 21:09
If the extreme values are errors, then Robert's suggestion of pre-processing the data to remove them is a good one. If they are just "noisy" (i.e. subject to measurement error) but still have value in predicting the response, then this makes me think of [errors in variables models](http://en.wikipedia.org/wiki/Errors-in-variables_models). From what you've said, there are no "errors" in the training data. Why is that? If the training data is drawn from a different population than the test data, it's no surprise the test data is being predicted poorly. — Macro, Sep 24 '13 at 21:52
It's very hard to suggest a specific algorithm without knowing the details of a problem. Here is a reasonably concise tour on outlier detection: http://www.charuaggarwal.net/outlierbook.pdf. Maybe even extreme value analysis would an appropriate method to try. — Robert Smith, Sep 24 '13 at 22:38
I just updated the question with some more clear explanation that answers @Macro and possible other interpretations of what I'm looking for. — Alex, Sep 25 '13 at 00:41
@RobertSmith: thank you for the ref. I'll take a look at the book. — Alex, Sep 25 '13 at 00:42
Thanks for clarifying @Alex. Re: *"Unless your training set includes that point, you are unable to tailor a model around it"* - you shouldn't try to tailor a model around it. Your description of how these errors arise tells us these extreme values are unrelated to the output, so a) the model fit to the test data shouldn't take them into account and b) you shouldn't bother trying to predict cases with those "error" inputs. It sounds like the errors are on a much larger order of magnitude than a "usual" point, so you might not need something fancy to remove these cases. — Macro, Sep 25 '13 at 00:57
@Alex I read your update. Can't you get enough data to have a representative sample of that .1% of extreme values? On the other hand, based on your description of a sensor, if those values are so extreme, just set a threshold and remove them. Paraphrasing the words of Pauli, those values are not even wrong. — Robert Smith, Sep 25 '13 at 01:02
@Macro: good point. by tailor model, i meant a way to avoid including the data point into the estimation/prediction phase — Alex, Sep 25 '13 at 02:11
you will find the answers to your questions [here](http://stats.stackexchange.com/a/50780/603) — user603, Sep 25 '13 at 09:46

score 1 · Answer 1 · answered Jan 18 '19 at 13:41

1

There are various techniques, starting from non-parametric (such as Theil-Sen estimator) to various optimization techniques with different penalization of residuals. Unfortunately, different estimators behave differently, and only modelling of your data and outliers may help to choose the right one.

answered Jan 18 '19 at 13:41

German Demidov

1,501
10
22

2

"and only modelling of your data and outliers may help to choose the right one." --> Regarding this, the RANSAC (Random sample consensus, https://en.wikipedia.org/wiki/Random_sample_consensus) could be used if you don't have a priori knowledge on what is an outlier and what not. – resnet Oct 21 '19 at 14:25

OLS robust to outliers

1 Answers1