Expressing Confidence in Conclusions from Noisy Data

Question

I’m working on improving the robustness of a software engineering process that measures performance of programming language compiler and standard library, a.k.a. benchmarks — in the computing sense of the word. The old process used the mean over 1 second periods, probably to deal with outliers that you can see in the raw data: These outliers are caused by uncontrolled variable-system load.

My aim is to improve the process to be robust in the presence of varying system load. This question is about using statistics to improve the quality of measurements that will then be used for hypothesis testing (did the performance change?).

There are multiple sources of measurement errors (hardware interrupts, context switches, etc.) that inflate the results. With the old process (i.e. averaging per second), these errors were were causing frequent false reports of change.

Since the distribution of errors and resulting data is clearly not normal, I’ve concluded that the mean was the wrong statistic to use for characterizing the measured values.

Complete bechmark suite measurements:

Now I’m a bit stuck answering questions about typical value and its uncertainty. My initial take was to use Median and Interquartile Range. It works much better than Mean and SD in the presence of outliers. But it still varies with the system load quite a lot. I haven’t looked at Mid-Mean and Trimmed Mean in much detail yet.

Looking at histograms, the measurements with a lot of samples have clearly defined Mode (sometimes more than one) that appears to be very robust to the varying system load. But I haven’t found very solid advice on how to pick the mode algorithmically and how to properly characterize the uncertainty of the mode. I’ve read some notes about picking the right bin size and using that for uncertainty.

Some colleagues have argued for use of Minimum, which has some precedent in the literature and appears to be robust in most cases. But I have found few benchmarks in our test suite where its use would be quite problematic. I also don’t know how to characterize the uncertainty if I just picked the most extreme value from a sample...

Most recently I’ve modified the measurement process to report quantiles from the sample (Type R-3, SAS-2). I thought about combining ventiles (20-Quantiles) from several independent measurements to better characterize the sample distribution and use that as input for the next step of hypothesis testing. Would that be more reasonable approach than picking a single typical value and computing its uncertainty, given that our sample distributions are non-normal, with very long tails and can sometimes even be multi-modal?

During exploratory data analysis of raw measurements guided by NIST Engineering Statistics Handbook, many graphical techniques were hindered by the presence of extreme outliers, which dominated the charts. I’ve adapted the technique from box plot to exclude outliers bigger than top inner fence: TIF = Q3 + 1.5 * IQR. This helped with scaling the charts so that main body of the signal became visible.

Given that our data set consists of repeated measurements of the same benchmark program, I believe it’s justified and necessary to remove the extreme outliers, based on the on following advice in the NIST Handbook, Chapter on Statistical control of the measurement process (emphasis mine):

Causes that do not warrant corrective action (but which do require that the current measurement be discarded) are:

Chance failure where the precision is actually in control

Glitch in setting up or operating the measurement process

Error in recording of data

If I use the above technique to preprocess the data, should I take into account this noise filtering step when expressing the confidence intervals for the statistics computed on the cleaned dataset?

What statistics are you interested in computing? Why do you think that computing these statistics on the raw values is worse than some modified values? Have you considered that preprocessing data to remove noise (variance) is directly related to a concomitant increase in bias? Or, stated another way, why would it be a bad idea to report a statistic and an estimate of its variability, since a point estimate from noisy data will necessarily have a larger estimate of variability? — Sycorax, Sep 04 '18 at 19:31
@Sycorax It might help to look at the linked interactive chart. My initial aim with that analysis was to find a better typical value. Original system was reporting mean and sd, which I think was misguided, because the distribution isn’t normal. So I thought Q1, Q3 and IQR would be better statistics to use. But I know the outliers on the right tail are corrupted samples, caused by uncontrolled variable (system load). I thought that computing typical range would also be a useful statistic, so I’ve [devised a naive filter](https://palimondo.github.io/robust-microbench/exclude-outliers.html). — Palimondo, Sep 04 '18 at 20:03
Why do you need to report statistics? What bad thing happens if you report the mean? This sounds a lot like an XY problem. — Sycorax, Sep 04 '18 at 20:13
@Sycorax I’m not sure what I need. The end goal is to be able to reliably detect statistically significant changes in the measured benchmarks. The mean used by the old system was [constantly reporting false improvements and regressions due to the accumulated error from the uncontrolled system load](https://palimondo.github.io/robust-microbench/analysis.html). I’m looking for more robust method to characterize the sample, for the purpose of comparing new and old system for the significant change. — Palimondo, Sep 04 '18 at 20:23
Palimondo, please address Sycorax's points by editing the question (not everyone reads comments, and they shouldn't have to to understand what you mean). — mkt, Sep 06 '18 at 09:20
I don't know much about benchmarking software, but my understanding is that people use the minimum as a benchmark because that estimates the "best case" for the software. Estimating a central statistic like a mean or a median implies that you're interested in a "typical case," where you're averaging over the uncontrolled variables (system load/whatever). That is, people who are measuring the minimum do so because it's the closest thing to getting at the software operating in a vacuum, without interference from other system processes. (I've voted to re-open.) — Sycorax, Sep 07 '18 at 19:58
It sounds like you may be interested in [tag:prediction-limit]s. For example: https://stats.stackexchange.com/questions/265856/which-binomial-prediction-interval-works-well-for-tail-probabilities-i-e-hat — Sycorax, Sep 07 '18 at 20:03
@Sycorax Yes, that has been the argument. But such measurements require system that is at rest. Which is why the measurements are typically done on a single thread. Any and all activities of modern multitasking OS interfere with that. Our Continuous Integration servers that do the benchmarks are running Java. When the parallel garbage collector kicks in, it interferes with measurements. I’m trying to design a system that will be more robust, to enable measuring of multiple benchmarks in parallel and be immune to interference from other processes. Thanks for the pointers! — Palimondo, Sep 07 '18 at 20:05

Expressing Confidence in Conclusions from Noisy Data

0 Answers0