Scenario Medicine
Let's say I'm developing a new medicine. I have two groups M0
(placebo) and M1
(real drug) where 1
means "healing observed" and 0
means "no healing observed". I run my experiment, get results which might look like this:
M0 = [ 0, 0, 0, 1, 0, 1, 0, 0, 1, ... ]
M1 = [ 1, 0, 0, 1, 1, 1, 1, 0, 1, ... ]
I would now perform a Kruskal-Wallis test KW(M0, M1)
resulting in a p-value, say PM = 0.04
. The interpretation of PM
is sort-of
If we assumed
M0
andM1
were based on the same distribution, only in 4 out of 100 times we'd find two observations looking as distinct asM0
andM1
are.
Practically,
- a low
PM
is "good", and - a low
PM
can be used to guide a "bet" on the medicine, e.g., whether a company should invest, or a patient should be treated if the cost and benefits of being right or wrong about that bet are known, - companies can pre-compute a
PM
cutoff value below which they can declare "certain enough" and commit to production.
Scenario Disc Brake
Let's instead say I'm developing a cheaper replacement disc brake for a car. I again form two groups, my replacement disc DR
, the original one DO
, and run an experiment where I measure if my brakes break (0
) or persist (1
) in certain environmental conditions:
DR = [ 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, ...]
DO = [ 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...]
If I now perform a Kruskal-Wallis KW(DR, DO)
I will get another p-value, say PD = 0.82
.
Here things become unclear to me, both how to interpret this value practically, and whether there is a better test available. In particular:
PD
seems to indicate that only in 82 out of 100 times (under the assumption my replacement disc is identical to the original) I'd see such an outcome. Does that mean I should not bet on my replacement disc if I wanted 5% confidence level?In other words, would I need
PD >= 0.95
to semantically reach the same level of confidence (relative to the medicine case) that my disc types are actually identical?From practical experiments it seems to be much harder (larger
N
needed) to reach high such high p-values for "replacement disc"-sort of experiments. Intuitively I sort-of get that testing for absence of differences is harder than showing differences exist, but I am wondering if there are better tests I could do in "disc brake" type of studies than in "medicine" studies?Is there something else I'm missing here?
[Edit - Background]: I'm writing a paper and interested in a mixture of both scenarios. I have a system that on theoretical ground for some parameters should behave differently and for other parameters should behave identically to a reference. What tests would ideally be needed to reasonably convince myself and my readers that they are the the same when used in some configurations, but are different for other configurations?