Is it legitimate to reduce the number of pairwise comparison hypotheses based on expectations?

Question

Context

I'm comparing 7 classification algorithms with corrected resampled t-tests in 3 times repeated 10 fold CV. I have educated guesses as to how their performance could line up. (For example, a transductive semi-supervised algorithm will probably perform better than an inductive supervised one which can use only the small labeled part of the same training data etc.)

Problem

Before looking at any classification performance data, I have created a lineup of expected performance and written down my reasons for believing so. I could announce only 6 hypotheses:

H1 algorithm A performs better than algorithm B

H2 algorithm B performs better than algorithm C

...

H6 algorithm F performs better than algorithm G

I could also perform 7(7-1)/2=21 pairwise comparisons irrespective of my intuitions. Since I need to use a Bonferroni correction (or the more powerful Holm approach), it would be advantageous to keep the number of hypotheses to a minimum.

Questions

Is it legitimate to reduce the number of pairwise comparison hypotheses based on expectations?

Do I see this correctly that H1 and H2 combined imply that algorithm A performs better than C without requiring a dedicated hypothesis to test this? If both H1 and H2 have their null hypotheses rejected, then obviously yes, but what if this is not the case?

Also, I would take two-sided tests in spite of my intuitions. Like this, nobody can tell me that I might have adjusted the direction of the one sided test after seeing the data. Doesn't this conservative approach contradict the other decision to reduce the number of hypotheses?

How are you using t-tests to compare classifier performance? If you are looking at classified correctly vs not for 2 different classifiers, you want to use McNemar's test (cf, [here](http://stats.stackexchange.com/a/185504/7290)). What are the normally distributed data that you are comparing? — gung - Reinstate Monica, Nov 04 '16 at 17:16
I am using ROC AUC for binary data-sets and F-measure for multiple classes. Data do not need to be normally distributed if you have enough sample points. Furthermore, the iid assumption of t-tests is dealt with by the corrected resampled t-test for repeated cross validation https://papers.nips.cc/paper/1661-inference-for-the-generalization-error.pdf The remaining question concerns the number of hypotheses for pairwise comparison. — David Ernst, Nov 04 '16 at 17:25

score 1 · Accepted Answer · answered Nov 09 '16 at 11:20

I'll give you my opinion, but I'm sure that others will have different thoughts. I'll also propose a non-traditional solution that others are free to argue with.

The correction entirely depends on what conclusion you wish to draw from your dataset. For example, in your article will you state:

Algorithm A performed the best out of all examined classifiers.

or will you state

When successively compared in order of mention, algorithm A was significantly different than algorithm B, with the trend continuing across all classifiers.

As you probably know, Bonferroni and derivatives correct the family-wise error rate (FWER). Obviously then, the error rate $\alpha$ to which you correct will depend on the family which you have defined. What you're implicitly asking in your question, then, is whether your set of 6 hypothesis is a reasonable approximation of the total family of 21 tests. At this point it becomes a field specific question.

I see your reasoning in saying that if $H_1$ is rejected indicating that $A > B$ and $H_2$ is rejected indicating that $B > C$, then $A > C$. However, there are two issues with this.

If you are testing two tailed hypothesis, what you're really testing is if $A = B$ and if $B = C$, and if $A \neq B$ and $B \neq C$ then it does not follow that $A \neq C$.
Assuming you switch to one tailed tests, then if you fail to reject any one of your chain, then the conclusion is put into question. For example consider tests which indicate $$ \begin{aligned} A > B \\ B > C \\ C \ngtr D \end{aligned} $$ Does it follow that $B > D$ or that $A > D$?

A possible (and contestable) solution to this issue may be stratefication. This is analogous to identifying groupings within you hypotheses and correcting them separately. In practice, this brings in prior knowledge about the probability of a test being a true null. Your case would have stratum 1 comprising of your 6 hypothesis and the other comprising the remaining pairwise comparisons. This would allow you to validate your assumptions while also retaining the weighting on your identified 6 hypotheses.

You may wish to consult Sun et al. for full details and instructions.

I would like to add though, that the most conservative course would be doing the pairwise comparisons and drawing conclusions from that.

I will read up on this stratification. Note that my experimental setup is already complicated enough since I have to deal with labeled and unlabelled data in semi-supervised learning. Your second bullet point regarding one tailed tests is well taken. I'm not sure if I agree with the first one though. If A not equal B and B not equal C, I can simply see from the averages if both unequals go in the same direction in which case I'm good. If not, I'm in the situation where I would need all 21 hypotheses. — David Ernst, Nov 09 '16 at 11:47
@user7019377 Sure, but it sounds like if you do a two tailed test and look at the direction, it might be worthwhile just to do the one-tailed test in the first place. Regardless, I think you can probably argue either option in the manuscript if you are transparent about your correction procedure. It might garner a second glance from the reviewers if you do decide to go for the 6 hypotheses though. I hope it works out. :) — Chris C, Nov 09 '16 at 13:00

Is it legitimate to reduce the number of pairwise comparison hypotheses based on expectations?

1 Answers1