4

I'm trying to remove what might be considered "unreasonable" data by evaluating the percent error in the mean and square root of the variance. Here's the setup:

Let's say I have three bids on a contract. The contractors' total bids are all relatively close. But the itemized breakdown of the bids can have extremely high variances in them.

For example:

  # Total Bid  Item 1  Item 2  Item 3  Item 4  Item 5
  - ---------  ------  ------  ------  ------  ------
  1 827,558    1,026   27.7    800     1,000   1,998
  2 667,118    950     25      80      3,000   23
  3 720,909    1,100   25      25      1,100   22.4
--- ---------  ------  ------  ------  ------  ------  
err 9.03       5.97    4.91    117     54.1    136.78

The "err" is the percentage error between the mean and the square root of the variance of each group, calculated as:

((mean - var^(1/2)) / mean) * 100

This metric does a great job representing the problem that I think I need to address. For example, the % error of Items 1 and 2 show that the bidders bid pretty consistently. It also indicates that the item bids were more consistent than the overall bid totals (error 9.03%).

By contrast, Items 3 - 5 show a higher degree of inconsistency, ranging from 54% to over 136%.

Here's what I know about the data a priori:

The high bids of Item 3 and Item 5 are garbage. By that, I mean, there's no real way to have anticipated those bids. It's just the bidder playing games with how they itemize their bids (really high on one item, really low on another) to mitigate extra costs if they get awarded the contract. In both Items 3 and 5, the lower bids are far closer to the value of the work.

Item 4 has a more ambiguous distribution. It could be that the lower bids represent the value of the work more accurately (and likely, they do), but it may also be the value is higher here than it seems. I might be reluctant to throw out the high bid and maybe consider a weighted average as the real value of the work.

I should also point out, that I'm using this data with a neural network. Ideally, the model's prediction error would be 15% or less.

So, in order to treat this as conservatively as possible, keeping outliers that might reasonably contribute to the model while throwing out ones that are obviously useless, I've considered a couple of approaches:

  1. Reject all bids for an item if the item's % error exceeds a set threshold.

  2. Reject only the most variant bids when % error exceeds the threshold.

It seems to me the best approach might be #1, using a threshold that scales with the desired error of the model...

Joel Graff
  • 167
  • 1
  • 13
  • a) you shouldn't use the mean and variance around it to find outlier: often it won't [work](http://stats.stackexchange.com/a/121075/603). I show it in that answer for the LOO mean and variance but the same argument applies to the plain vanilla mean and variance. b) You shouldn't use univariate techniques for [multivariate outlier detection](http://stats.stackexchange.com/a/50780/603): often it won't work because mutlivariate outliers need not stand out on any one axis to be seriously harmful. – user603 Apr 03 '15 at 15:09
  • On the second point, I had considered that possibility. Which is why I'd resolved to be as conservative as I could in removing outliers, knowing what appears to be a big error may resolve when the model runs it. I had the thought of perhaps applying this technique after selecting outliers by quantile. That way, I know I'm removing "relative extremes" by variance (which may not be extremes after all) from a list of "absolute extremes" selected by quantile, if that makes sense. Seems like combining the techniques might mitigate the data quality issues that each introduces... – Joel Graff Apr 03 '15 at 16:00
  • I don't fully understand your comment, but I discourage you from reinventing the [wheel](http://stats.stackexchange.com/q/213/603), or at any rate of doing so before reading a bit on the [subject](http://www.amazon.com/Robust-Statistics-Ricardo-A-Maronna/dp/0470010924). – user603 Apr 03 '15 at 16:02
  • I edited my comment, but ran out of time... Your second point deals with outliers I wasn't concerned about. I'm only going after these extremes because I know they demonstrably affect model performance. I just don't want ot be overzealous by removing them. Yes, I know that univariate techniques aren't appropriate for multivariate problems, but this is a somewhat special case... Anyway, I'll do a little more research on the links you provided. – Joel Graff Apr 03 '15 at 16:09
  • are you sure that by selecting the outliers one variable at a time you are being conservative? One can show that if the data is multivariate and contaminated doing so would lead you to flag the wrong observations as outliers. However, I do not recall having seen an argument to the effect that the univariate approach would flag less observations as outliers than the multivariate one. In many situation I know that the opposite holds (the univariate approach would flag *more* observations as outliers) – user603 Apr 03 '15 at 18:18
  • I certainly agree - a univariate analysis would naturally chose across multiple dimensions of multivariate data and thus be less discriminating in it's selection. To the point, though, I'm not using any of the variables which inform my model to select my outliers, here. I'm essentially selecting outliers based on the magnitude of the variance of the dependent variables alone. Knowing that the magnitude of the "useless" data is generally *very* large (100% or more), I would leave in anything less. In one case, this appears to select maybe 2 or 3 points out of 8,400. – Joel Graff Apr 03 '15 at 18:36
  • I hope I don't come off as brash or inconsiderate here. I just want to point out (as is done in the book I direct to and many others beside as well as in a couple of posts here) that it is possible for rules like the one you propose to end up excluding a few good observations and retain as many or more nasty outliers. This is because outliers can pull a model fitted non robustly so much towards themselves that they will appear well fitted by it. In fact it is quiet easy to construct such examples with synthetic data and more to the point many real data sets are like that too... – user603 Apr 03 '15 at 19:48
  • ...There exists a principled way to think about this problem. Solution to your problem exist, they are widely implemented and studied, explained in textbook and they come with some formal guarantees. It would be wise to consider them. There are worst ways to start than searching this forum for multivariate outlier detection methods. You could have a look [here](http://www.geo.upm.es/postgrado/CarlosLopez/papers/FastAlgMCD99.pdf) and [here](http://cran.r-project.org/web/packages/rrcov/index.html) too. – user603 Apr 03 '15 at 19:59
  • 1
    Thanks for the input. That's why I made the post in the first place. I'll definitely take a closer look at the links - if I can solve my problem and address outliers that would otherwise be difficult to detect, that's certainly better. – Joel Graff Apr 03 '15 at 21:53

1 Answers1

3

After taking some time to investigate the topic of robust statistics more thoroughly, I've opted for a better, although not ideal, way to select my outliers.

This post identifies the Mean Absolute Deviation (MAD) as an excellent robust alternative to mean variance. Of course, it's application is univariate and my use case is multivariate.

Thus, I dug a little deeper and discovered the rrcov package in R. It works nicely, providing distance plots for multivariate data. It identified those points I knew a priori to be outliers, as well as a good number of points that I could have not otherwise identified.

The implementation I'm opting for, then, is to perform this multivariate distance analysis using the tools provided by the rrcov library. Having reduced the data to a series of distances from the median, I can then apply a univariate technique like MAD to select the least variant (or "nearest") data points.

A few caveats that I have identified:

  1. The ability to detect outliers is limited by how well the features of the data set describe the dependent variable. Thus, I cannot assume that, just because I've used a multivariate technique, that I'm detecting "true" outliers. Conversely, as I discover other features which may better describe the dependent variable, I can expect my outlier detection to improve as well.
  2. While robust statistics like MAD tolerate the use of cut-off rules, this requires certain assumptions about the data that do not apply in my case. Thus, I've opted for simply taking a fixed percentage of the "nearest" data points (e.g. 95%), noting that the remainder left out will be the most variant for the data features I've chosen.
  3. MAD assumes a symmetric data distribution. I don't have that, unfortunately. As a result, I'm using the S estimator (see Wikipedia) which employs the relative difference between datapoints and is, therefore, less susceptible to asymmetric distributions.

This question (and user603's fantastic insights) have given me a much better understanding of robust statistical methods, and I am certainly more comfortable with the idea of removing outliers. It may be true that every point is "there for a reason", but until I can adequately describe that reason (as relevant dataset features), it simply is not a useful datapoint.

Joel Graff
  • 167
  • 1
  • 13
  • 2
    Removing outliers is the most effective way of manipulating the data to obtain the desired results. – Aksakal Apr 07 '15 at 19:37