I’m working on improving the robustness of a software engineering process that measures performance of programming language compiler and standard library, a.k.a. benchmarks — in the computing sense of the word. The old process used the mean over 1 second periods, probably to deal with outliers that you can see in the raw data:
These outliers are caused by uncontrolled variable-system load.
My aim is to improve the process to be robust in the presence of varying system load. This question is about using statistics to improve the quality of measurements that will then be used for hypothesis testing (did the performance change?).
There are multiple sources of measurement errors (hardware interrupts, context switches, etc.) that inflate the results. With the old process (i.e. averaging per second), these errors were were causing frequent false reports of change.
Since the distribution of errors and resulting data is clearly not normal, I’ve concluded that the mean was the wrong statistic to use for characterizing the measured values.
Complete bechmark suite measurements:
Now I’m a bit stuck answering questions about typical value and its uncertainty. My initial take was to use Median and Interquartile Range. It works much better than Mean and SD in the presence of outliers. But it still varies with the system load quite a lot. I haven’t looked at Mid-Mean and Trimmed Mean in much detail yet.
Looking at histograms, the measurements with a lot of samples have clearly defined Mode (sometimes more than one) that appears to be very robust to the varying system load. But I haven’t found very solid advice on how to pick the mode algorithmically and how to properly characterize the uncertainty of the mode. I’ve read some notes about picking the right bin size and using that for uncertainty.
Some colleagues have argued for use of Minimum, which has some precedent in the literature and appears to be robust in most cases. But I have found few benchmarks in our test suite where its use would be quite problematic. I also don’t know how to characterize the uncertainty if I just picked the most extreme value from a sample...
Most recently I’ve modified the measurement process to report quantiles from the sample (Type R-3, SAS-2). I thought about combining ventiles (20-Quantiles) from several independent measurements to better characterize the sample distribution and use that as input for the next step of hypothesis testing. Would that be more reasonable approach than picking a single typical value and computing its uncertainty, given that our sample distributions are non-normal, with very long tails and can sometimes even be multi-modal?
During exploratory data analysis of raw measurements guided by NIST Engineering Statistics Handbook, many graphical techniques were hindered by the presence of extreme outliers, which dominated the charts. I’ve adapted the technique from box plot to exclude outliers bigger than top inner fence: TIF = Q3 + 1.5 * IQR
. This helped with scaling the charts so that main body of the signal became visible.
Given that our data set consists of repeated measurements of the same benchmark program, I believe it’s justified and necessary to remove the extreme outliers, based on the on following advice in the NIST Handbook, Chapter on Statistical control of the measurement process (emphasis mine):
Causes that do not warrant corrective action (but which do require that the current measurement be discarded) are:
- Chance failure where the precision is actually in control
- Glitch in setting up or operating the measurement process
- Error in recording of data
If I use the above technique to preprocess the data, should I take into account this noise filtering step when expressing the confidence intervals for the statistics computed on the cleaned dataset?