1

When performing multiple hypothesis tests, for example in stepwise model selection, we need to apply something like the Bonferroni correction to the alpha/significance value in order to avoid too many false positives. However, criteria such as AIC, BIC etc are also used for model selection, but distinctly don't have a significance level.

Is it necessary to apply a correction to these metrics? It certainly seems so, since we are likely to find more positives by virtue of the amount of tests we are performing.

If yes, what should it be?

Migwell
  • 273
  • 2
  • 11
  • 1
    Not an answer, but criteria like AIC may implicitly have a significance level, see e.g. https://stats.stackexchange.com/questions/275861/aic-and-bic-criterion-for-model-selection-how-is-it-used-in-this-paper/275948#275948. – Christoph Hanck Feb 23 '21 at 09:40

1 Answers1

1

If you compare multiple model fits based on some criterion and then try to compare the pairs of models, then you have the same pairwise comparisons problem that Tukey's method solves in the one-way ANOVA. A major wrinkle here is that the the statistics you are comparing are highly correlated, being based on nearly the same models.

This problem has been solved in the case where the model criterion is correct classification probability, in the following paper:

Westfall, P.H., Troendle, J.F. and Pennello, G. (2010). Multiple McNemar Tests. Biometrics 66, 1185–1191.

In particular, the strong correlation structure between the models that are nearly the same is "harvested" to improve the power of the multiple comparisons procedure.

You could extend that method to apply to likelihood-based tests, but it is new research as far as I know. You need to write the fit statistic as a mean of iid observation-level quantities. Then you can use a multivariate bootstrap to construct multiple comparisons between the means. The bootstrap provides a useful and valid way to incorporate the extremely high dependencies between different models. For the general multivariate idea, see

Westfall, P.H. (2011). On Using the Bootstrap for Multiple Comparisons. Journal of Biopharmaceutical Statistics 21, 1187–1205.

Good luck!

BigBendRegion
  • 4,593
  • 12
  • 22