How to test for inequality in the presence of non-independent noise?

Question

I have multiple samples which include a response time of a system. I want to test if no sample is significantly different (primarily the expected value). For two sample testing I'm using the sign test and for testing multiple ones at a time I'm using the Friedman test.

Unfortunately, the samples have non-independent noise (verified with Hoeffding's test, p-value < 1e-8). In practice that means that for samples with over 10000 observations, the sign test and Friedman test show statistically significant differences (p-value < 1e-6) for samples that were measurements of exactly the same input.

What is the recommended practice for dealing with with non-independent, non-uniform, multimodal, heteroskedastic noise for repeated measurements data?

Data acquisition

The measurements are performed as follows:

First create a list of randomly ordered tuples of tests to perform (e.g., given three inputs, A, B, and C, it could be something like ABC, CBA, CAB, BCA, etc.)
Run the tests in that random order (e.g. send input A, wait for reply, send B, wait for reply, send C, wait for reply, send C, wait for reply, send B, wait for reply, etc.)
- The inputs are sent by a Python application over a regular TCP connection, the "protocol" is just connect, send query, wait for response, close connection
Have a system running in the background that monitors the communication, noting the time between query and response, saves that as a list (continuing the example: 68028 ns, 69667 ns, 67971 ns, 68535 ns, 69458 ns, 67767 ns, 68335 ns, ...)
- This is done by tcpdump
Combine the knowledge of the ordering of the tests to the noted times to get measurements for specific tests (continuing the example, for A I then get 68028 ns, 67767 ns, 67822 ns, ..., for B I get 69667 ns, 69458 ns, 68314 ns, ..., and for C I get 67971 ns, 68535 ns, 68335 ns, ...)

Data example

Example scatter plot of a pair of samples (axes are in seconds):

If you have very large samples almost any difference will achieve some level of statistical significance. That is the curse of using p-values as opposed to effect sizes with a confidence interval. — mdewey, Oct 22 '21 at 10:36
@mdewey 1. any difference is precisely what I'm after, I want to detect differences down to single nanoseconds while the median absolute deviation I see is in the order of 1 µs. 2. I get the exact same issue if I use bootstrapping to get confidence intervals for median, this also shows that there are differences for samples that shouldn't have them. — Hubert Kario, Oct 22 '21 at 11:31
Could you please clarify what "no sample is significantly different" means? "Significantly different" is a *comparison,* not an absolute quality of a sample, and it also begs the question of *what properties* are you comparing. (After all, *every* sample differs from any other non-identical sample in some way.). One of the best ways to explain yourself would be to tell us what your null and alternative hypotheses are. Also, could you explain the basis for characterizing your noise as "multimodal, heteroskedastic"? The *degree* to which these hold is more important than whether they do. — whuber, Oct 22 '21 at 21:42
@whuber like I wrote, the expected value, I'm ok with more or less any measure: median, stochastic dominance, etc. By multimodal I mean that it has at least 3 significant modes, separated by multiple (more than 4) standard deviations of those modes. heteroskedasticity is similar, in that the variance of the observations increases by a factor of 5. — Hubert Kario, Oct 23 '21 at 17:21
With such a large number of observations that multimodality is of no concern. Heteroscedasticity matters more -- but the form of the dependence ought to be the predominant issue to address. A clearer description of your data would be helpful for suggesting appropriate measures to take. — whuber, Oct 23 '21 at 17:27
@whuber I'm stuck on how to describe the dependence. I have no ideas where it comes from, and I know of no standard statistical kinds of dependence. The data is more clearly described in my other question: https://stats.stackexchange.com/q/548953/289885 but if there's anything unclear I can explain (up to an including giving a fully open source reproducer and instructions on how to run it). — Hubert Kario, Oct 23 '21 at 17:51
What is non-independent noise? Does it mean that the noise is not the same distribution for the measurements of the two samples? — Sextus Empiricus, Oct 24 '21 at 00:26
I am unable to reconcile the plot, which is *bivariate,* with the statement of the problem, which--although vague--sounds *univariate.* Could you explain what you mean by "samples" and "noise"? — whuber, Oct 24 '21 at 13:40
@whuber The plot is representing the two samples compared against each-other, i.e. what the sign test "sees". A "sample" is all the measurements of processing time of the system under test. "Noise" is all the variability in the measurement I see: if I'm comparing time to process input, and then compare it to time to process the same input, I'd expect to see a bunch of zeros. When comparing it to some different input I'd expect to see some non-zero value repeated a bunch of times. — Hubert Kario, Oct 24 '21 at 23:25
You describe a "sample" as consisting of "processing time." Time is univariate. How, then, do you obtain the pairs of data needed to construct a scatterplot? How do you separate noise from the signal? Are you including *all* variation as "noise"? — whuber, Oct 25 '21 at 13:26
@whuber I'm testing 3 inputs at a time, but in random order, if two of those are the same, I get pairs of values that have been tested at similar time, even if I repeat the test of the tuples few thousand times. The graph is those pairs of measurements. — Hubert Kario, Oct 25 '21 at 15:30
@whuber If I'm measuring the same thing in both samples, and then calculating the difference between those measurements, isn't any departure from 0 noise? — Hubert Kario, Oct 25 '21 at 15:35
Perhaps, but it's unclear what you mean by "testing 3 inputs at a time." One thing that is becoming evident is your situation is more complex than stated in the question. It really would help to have a fuller description of what you're doing. — whuber, Oct 25 '21 at 15:44
@whuber added section on data acquisition, is that sufficient, or would you like to have more details? — Hubert Kario, Oct 27 '21 at 10:09
Relevant to [@mdewey](https://stats.stackexchange.com/questions/108911/why-does-frequentist-hypothesis-testing-become-biased-towards-rejecting-the-null/108914#108914)'s comment. — Alexis, Oct 28 '21 at 16:51
@Alexis As I wrote, "I get the exact same issue if I use bootstrapping to get confidence intervals for median". The problem is not with the test, the problem is with the data. The effect size that the test detects is much bigger than what I want to be able to detect. — Hubert Kario, Oct 28 '21 at 19:54

score 0 · Accepted Answer · answered Nov 01 '21 at 19:01

More data

I did an additional run with N=12800, where all the probes were sending the exact same values, but were generated by 2 different objects. (Same classes, initialised with the same values, but separate instances).

The first 4 were generated by one object, while the last 3 were generated by a second object.

I've then bootstrapped confidence intervals for the calculated medians of differences:

Reinterpreting the data from question

Neither looking at the PACF of the differences between B and A:

Nor running tests like the Wald-Wolfowitz runs test strongly point to lack of independence of the differences between samples (p-value=0.017).

But if we look at the windowed median (window=1000, step=1):

It's fairly clear that the false positive in the sign test isn't caused by one large single excess or periodic excess in one sample over the other, but rather in systematic departure from the zero median.

Hypothesis

The false positives are caused by small, but constant differences in how the different instances of objects are handled in the python interpreter. This in turn has effect on the delay between when the connection is opened and when the data over it is sent. And that affects how quickly the server responds. In other words, it's the effect of PYTHONHASHSEED and/or ASLR.

Solution

The solution is to send probes from new processes, so that any effect from PYTHONHASHSEED and/or ASLR is the same for different instances.

Verification

I've thus modified the execution of the tests so that any single process isn't used to send more than 100 probes. I've then repeated the experiment few dozen times with large sample sizes (>100k) and not only are the false positives gone, the distribution of the p-values of those tests is perfectly uniform.