I'm working on a classification problem and I have a very high F1 baseline of 85%. I have trained three classification models and I want to know which one is the best. How I can do so?
I tried two ways:
To compare each model against the baseline using paired t-test. So I have tests like:
baseline vs. model 1 | baseline vs. model 2 | baseline vs. model 3
That tells me that only model 1 is significantly higher than the baseline and so I concluded that model 1 is the best. Is this a valid methodology given that usually classification models are compared against baselines?
To compare all models in one fell swoop with one-way ANOVA. So entered the information of modals 1-3 AND the baseline with gave me a p-value of 0.02 indicating that there is a difference in means. Yet, with a post pairwise test, there is no significance between any of the pairs.
Which method is the correct one?