An a posteriori test to determine non difference across >3 groups given a skewed continuous variable of interest, as an alternative to a Friedman-test

Question

Given a set of items of which each item is once measured using a computative model, each time under a different condition, what is the best way to quantify the magnitude of difference between measurements taken under different conditions?

My goal is to illustrate that there is no practical difference in measurements across these models when conditions change, and not just to illustrate a significant difference between one group mean and that of any other group at a certain set alpha and beta. A rejection of the null-hypothesis upon testing for differences in group-means in practice does not always seem informative, especially given large sample sizes, skewed and "taily" data and and small differences between group means and medians.

For example: Although the measurement methods described below in my practical case yield nearly identical measurements of the phenomena, an analysis using the Friedman-test leads to a rejection of the null-hypothesis (test statistic of 2977.98 with p-value <<0.001 using; https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.friedmanchisquare.html) as well as with a RM ANOVA (F=33.3, p-value <<.01 using; https://www.statsmodels.org/stable/generated/statsmodels.stats.anova.AnovaRM.html).

Towards this goal, given a single continuous variable as outcome, paired data/repeated measures, equal group sizes, no replications, what could be an appropriate alternative to performing a ANOVA (KW ANOVA, RM ANOVA or Friedman-test) to determine presence of any group mean differences followed by the appropriate post hoc procedures to pairwise determine which group means significantly from each other?

My data has >1000 images automatically assessed by a computational model for a single quantitative feature under six different methods (e.g. an automated assessment of % lung emphysema based on X-ray, where in each repeated measurement one of six types of noise is present in the image). The outcome variable is normally distributed around the median, but not around the mean across the six groups.

Crudely put, I am looking for a way to illustrate "robustness" or sustained precision of the measuring instrument under changing condition. It being able to produce measurements on a given item under different conditions.

A significant difference in group means in some practical cases does not imply at all a practical difference. Given the absence of a ground truth, what approach can be done to comprehensively illustrate that this set of methods yields measurements that are in strong accordance with each other besides the above approach or simply reporting CIs and pairwise correlation coefficients? Is reporting average within-subject variability not descriptive enough? Thank you for any input!

Answers and comments for this recent question may be helpful: [stats.stackexchange.com/questions/552497/why-cant-be-hypothesis-testing-done-in-opposite-way](https://stats.stackexchange.com/questions/552497/why-cant-be-hypothesis-testing-done-in-opposite-way) — Sal Mangiafico, Dec 02 '21 at 12:10
Thank you for your quick comment. It is quite dense for me, your answer to that questions. I am struggling to put this approach into the context of my case. What would a confidence distribution look like in this analysis, given the repeated measures and levels? If there is something specific you would like me to understand, relevant to my case, could you give me a clue? Very helpful. — Levi.Steinberg, Dec 02 '21 at 17:13
Well, I think you've identified the problem: with a large sample size, these hypothesis tests are likely to have enough "signal" versus the "noise" of variability to report that there is a detectable difference among the different conditions. ... One thing to consider is TOST equivalence testing. ... Another idea is to use measures of "accuracy" like MAPE, RMSE, CV, but you'd have to apply judgement as to whether these results are meaningfully high or not. ... Finally, a plot of the paired data, with a 1:1 line imposed may give a good sense of how similar or different the results are. — Sal Mangiafico, Dec 02 '21 at 19:03
I am looking for a test statistic for presentation in the end. I have thought about RMSE, and indeed I have trouble with interpreting what the magnitude of this statistic means for the . I have looked at TOTS, as it was mentioned in the comments in the linked questions. Would you perform pairwise comparisons of groups, 21 pairs in the case of six conditions, and do some kind of p-value adjustment for the amount of pairwise comparisons? What about reporting some kind of metric on the average variance at the subject level? — Levi.Steinberg, Dec 02 '21 at 20:43
Well, I'm not sure what to really suggest.... I think most of my suggestions would have to be done pairwise. In the case of e.g. multiple equivalence tests, I don't think I would apply a p-value correction, but it may make sense to do so. ... The within-subject variance idea may be helpful. ... — Sal Mangiafico, Dec 02 '21 at 22:04
Thank you for your input so far. Hopefully you can advise me once more. Why would you not perform a p-value correction? What kind would make sense to do. I performed 15 pairwise comparisons at a lower and upper bound of -+0.3. All TOST using statsmodels.stats.weightstats.ttost_paired in python yield p-values smaller than 0.05, with the larges p-value being something to the order of 10^-29. Would I just have to set the alpha to 0.05/( 15-1) and call it a bonferroni correction? — Levi.Steinberg, Dec 03 '21 at 12:03
I think I wouldn't perform a *p* value correction simply because I would ***not*** want to prioritize finding false differences, in this case, over false equivalence. But if your *p* values are like 10^-29, it's not going to matter anyway. And using a *p* value correction makes the case that you _are_ prioritizing finding differences .... I think Bonferroni correction would be 0.05/15 for 15 *p* values. Now, if you're using two tests for each pairwise comparison, I don't know if that's 30 *p* values or not. ... — Sal Mangiafico, Dec 03 '21 at 13:52

An a posteriori test to determine non difference across >3 groups given a skewed continuous variable of interest, as an alternative to a Friedman-test

0 Answers0