How to properly ignore results with high variance

Question

I'm trying to estimate performance results of different configurations. In each test one machine is generating requests to a server for x minutes.

The output is: 1. Number of attempts 2. Number of successful requests 3. Time of each request

My problem is that (as an example) most of the requests take about a second or less and then there are a few requests that take 120 seconds.

I need to make a clear and simple output graphs. So:

A. Is there a proper way (formula) to "ignore" results that are larger than x? I can simply omit some results from the average but was wondering if there is a more elegant way to add it to a formula.

EDIT: Deleted the second question.

You have two big and different questions bundled together. B asks for how to determine "best", but without knowing the trade-off between success and speed, there are no obvious simple answers. A would usually be regarded as the problem of how to deal with **_outliers_**. The title word "variance" is not quite right, as the problem is how to deal with some very high values. I've added a tag `outliers` but you should look at some of the highest voted threads on that. You don't seem to have a new or different problem there. — Nick Cox, Apr 13 '18 at 10:59
For the A part of your question. I would bin the data (say in 30 minutes interval). Start with a coarse model for the non outlying values of 'Time of Request', like exponentially distributed (eventually allowing for a non 0 shift parameter). Have a look at how to detect outliers in that setting, for example in [these answers](https://stats.stackexchange.com/questions/129274/outlier-detection-on-skewed-distributions/129297#129297). Hopefully, that would help you identify the extreme measurements. — user603, Apr 13 '18 at 11:40
I have changed the title to outline the time series as well as outliers detection aspect of your question. Have a look at detection of outliers in time series context here. Feel free to change back! — user603, Apr 13 '18 at 11:46
@user603 I don't see emphasis on time series here. Edits should just be minor. Naturally I agree that the full problem might entail also looking at dependence in time, but that's a matter for comment rather than rewording the title on the OP's behalf. — Nick Cox, Apr 13 '18 at 11:55
Post-edit: Have you tried specifically the approach in [this](https://stats.stackexchange.com/a/129277/603) answer by Glen_b? — user603, Apr 13 '18 at 21:01

How to properly ignore results with high variance

0 Answers0