1

I have calculated the empirical distribution of a certain metric for two different conditions A (blue) and B (yellow). The analytic distributions are unknown. Plotted are the kernel density estimators for each distribution.

I am interested to test whether A has too many large values under the assumption that A came from the same distribution that B came from. My naive reasoning is that if I take the 1% largest percentile of B, there are 30% of values of A that fall above that threshold. Given that I have tens of thousands of datapoints, this is clearly significant, but I don't know how to formalize the test. I have naively attempted comparison of the means (e.g. t-test), but the mean of B is actually higher than the mean of A, and is clearly not the test I am looking for.

enter image description here

Edit: I'm not 100% sure, but what I might want is a Kolmogorov-Smirnov test, but without the absolute value, just testing the maximal difference of empirical CDFs, not the maximal absolute value difference.

Aleksejs Fomins
  • 1,499
  • 3
  • 18
  • That plot does not look like those at https://en.wikipedia.org/wiki/Kernel_density_estimation - What is it showing? – Henry Feb 24 '22 at 15:08
  • @Henry It is a violin plot from python seaborn library. The violin plot uses KDE somewhere internally. The plot is distorted towards lower values, because of the log scale, and because there is no option to tell the plotting library that the values cannot go below zero. I can attempt to provide a better plot if you think it helps. – Aleksejs Fomins Feb 24 '22 at 16:28
  • 1
    So the horizontal width is the kernel density while the logarithmic vertical axis represents the possible values of the metrics, though you have not adjusted the charts for the logarithm, which is why the visual areas are so different. – Henry Feb 24 '22 at 16:43
  • The next question is what you are actually trying to test. The chart is visually suggestive that the two distributions are different (the observed values of A are clearly more dispersed than those of B) and with a large enough sample any decent test would reveal this. If you want a one-sided K-S test, you might want to look at https://stats.stackexchange.com/questions/107668/does-it-make-sense-to-perform-a-one-tailed-kolmogorov-smirnov-test or https://stats.stackexchange.com/questions/43451/whats-the-null-hypothesis-in-a-one-sided-kolmogorov-smirnov-test – Henry Feb 24 '22 at 16:53
  • @Henry Thanks, I have had a look at the links before, but it seemed that the procedure is highly non-standard. I will read in greater detail later. I don't want to test dispersion. Distribution of A has complicated origins, that include effects weaker than B, as well as effects that come from B, and may or may not include effects stronger than B. I want to test whether B can account for all of the variance above its mean. It is not supposed to account for the variance below its mean, it is not designed for that. I am sorry, I am aware this is not very precise. I need time to figure this out – Aleksejs Fomins Feb 24 '22 at 17:16
  • Ok, I have convinced myself that one-sided two-sample KS is exactly what I am looking for. Thanks for your help – Aleksejs Fomins Feb 24 '22 at 18:49
  • @AleksejsFomins If you have an explanation for why, you might be interested in posting (perhaps accepting) a self-answer, so that we can consider this question wrapped up. – Dave Feb 24 '22 at 18:56
  • Yes, I would like to do that. I am exceedingly busy right now, submitting thesis for checkup in a few days, just wanted to check something last minute. I will write a proper answer maybe in a few days once that is over – Aleksejs Fomins Feb 24 '22 at 22:46

0 Answers0