1

I'm working on a classification problem and I have a very high F1 baseline of 85%. I have trained three classification models and I want to know which one is the best. How I can do so?

I tried two ways:

  1. To compare each model against the baseline using paired t-test. So I have tests like:

    baseline vs. model 1 | baseline vs. model 2  | baseline vs. model 3
    

    That tells me that only model 1 is significantly higher than the baseline and so I concluded that model 1 is the best. Is this a valid methodology given that usually classification models are compared against baselines?

  2. To compare all models in one fell swoop with one-way ANOVA. So entered the information of modals 1-3 AND the baseline with gave me a p-value of 0.02 indicating that there is a difference in means. Yet, with a post pairwise test, there is no significance between any of the pairs.

Which method is the correct one?

chl
  • 50,972
  • 18
  • 205
  • 364
Sabba
  • 185
  • 1
  • 1
  • 6
  • If you're comparing the performance of various classifiers wouldn't it be better to use something like the receiver operating characteristic (ROC) curve? –  Jun 11 '13 at 06:11
  • But not all my classifiers are binary. – Sabba Jun 11 '13 at 15:23

1 Answers1

0

I believe both approaches are valid.

When you run simple pairwise T-tests you are being more sensible for differences and not considering that you will run other tests.

On the other hand, the post hoc tests are more conservative, but sometimes can be a little bit controversial.

Suppose you are running a post hoc tests among 10 groups and there is no difference (at some level of confidence). Weeks later, you decide to run the same test, but only with 5 groups and now you have a difference. What can be inferred in this experiment, if nothing changes?

When you have strong evidence of difference, both approaches should indicate the same and it is done. When you have small fluctuations and "apparently" some divergences between approaches you should consider other signals like: which model the hypotheses fits better, which one is more robust for noise, etc.

tlewin
  • 116