I'm trying to remove what might be considered "unreasonable" data by evaluating the percent error in the mean and square root of the variance. Here's the setup:
Let's say I have three bids on a contract. The contractors' total bids are all relatively close. But the itemized breakdown of the bids can have extremely high variances in them.
For example:
# Total Bid Item 1 Item 2 Item 3 Item 4 Item 5
- --------- ------ ------ ------ ------ ------
1 827,558 1,026 27.7 800 1,000 1,998
2 667,118 950 25 80 3,000 23
3 720,909 1,100 25 25 1,100 22.4
--- --------- ------ ------ ------ ------ ------
err 9.03 5.97 4.91 117 54.1 136.78
The "err" is the percentage error between the mean and the square root of the variance of each group, calculated as:
((mean - var^(1/2)) / mean) * 100
This metric does a great job representing the problem that I think I need to address. For example, the % error of Items 1 and 2 show that the bidders bid pretty consistently. It also indicates that the item bids were more consistent than the overall bid totals (error 9.03%).
By contrast, Items 3 - 5 show a higher degree of inconsistency, ranging from 54% to over 136%.
Here's what I know about the data a priori:
The high bids of Item 3 and Item 5 are garbage. By that, I mean, there's no real way to have anticipated those bids. It's just the bidder playing games with how they itemize their bids (really high on one item, really low on another) to mitigate extra costs if they get awarded the contract. In both Items 3 and 5, the lower bids are far closer to the value of the work.
Item 4 has a more ambiguous distribution. It could be that the lower bids represent the value of the work more accurately (and likely, they do), but it may also be the value is higher here than it seems. I might be reluctant to throw out the high bid and maybe consider a weighted average as the real value of the work.
I should also point out, that I'm using this data with a neural network. Ideally, the model's prediction error would be 15% or less.
So, in order to treat this as conservatively as possible, keeping outliers that might reasonably contribute to the model while throwing out ones that are obviously useless, I've considered a couple of approaches:
Reject all bids for an item if the item's % error exceeds a set threshold.
Reject only the most variant bids when % error exceeds the threshold.
It seems to me the best approach might be #1, using a threshold that scales with the desired error of the model...