I am trying to prove there is statistical significance when I compare two classifier methods.
My proposed method only makes modifications on certain cases, the rest of them are still the same as the baseline, but the AUC is improved. I tried using DeLong and bootstrapping methods but the p-value I get is very high (my guess is it's because the classifier output is the same for most of the cases).
Is there any method that can take into account that only a few cases are modified for the new classifier?
Any help is appreciated. Thanks!