4

It's pretty well acknowledged that error control via p-values fails when models are selected based on the data rather than decided on a priori. I've always viewed this as an issue of marginal vs conditional distributions, such that: $$P(error) = P(error | M = M_1)P(M = M_1) + P(error | M = M_2)P(M = M_2) \neq P(error | M=m)$$ where $M$ refers to the model to be selected, $M_i$ are the models, and $m$ is a realized value of $M$.

However, 'robust' p-values still maintain error control under incorrect models, provided that the focus of inference remains the same under each model¹. Hence, as our p-values control error under each model², the maximum error rate is still below whatever alpha we choose. Am I missing anything here?

¹ This does not occur generally, the expected value of the relevant parameter should be the same under each model

² It would seemingly still affect power calculations though

2 Answers2

5

I think you are missing something, but it depends on your definition of robustness.

I would have defined robustness as saying for any fixed $M_i$, $P(error|M=M_i)$ is the same (or is controlled), regardless of whether $M$ is correctly specified.

What you need to avoid post-selection problems is that $P(error|M=m)$ is correctly specified for random $m$. This requires not only that $P(error|M=M_i)$ is the same for all $M_i$, but also that the event $[error | M=M_i]$ is independent of the event $[m=M_i]$. You can come up with settings where this is true, but it's a much stronger condition than just having correct errors under misspecification.

You can often still control the post-selection error probability -- for example, if the null hypothesis is one that can be tested by permutation, do a permutation test over the entire model selection procedure. But that's stronger than robustness -- the reference distribution for the test depends on the entire set of models being selected

In at least one post-selection inference problem there is a proof that no confidence interval procedure can have correct coverage (even asymptotically)

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
2

The answer by Thomas Lumley provides a good explanation for why there is a problem. TO put it in other words: If you are fine no matter which model you pick before seeing the data, you are still not necessarily fine if you pick the model based on the data, because models are only fine on average across all possible data realizations (and different ones may have different problems for different data realizations).

There's some more examples, e.g.:

  • Switching between t-test and Wilcoxon-rank-sum test based on whether a test for normal residuals is signficant is known the inflate the type I error. This happens even when the residuals are perfectly normal so that either test would be valid.
  • Changing the analysis (random effects model assuming no carry-over or analysis of only the first period) in a cross-over clinical trial depending on whether a test for carry-over is significant is known to inflate the type I error (even if there is in truth no carry-over so that both analyses are valid).
  • Switching between a fixed effects and a random effects meta-analysis model depending on a test for heterogeneity of effects is known to inflate the type I error, even when both analyses would be valid.
  • Doing linerazing transformations based on a test of linearity (e.g. testing for the significance of a quadratic term in a model) inflates the type I error.

There's probably a huge number of further examples one can find.

This is all very consistent with the statistical literature that has long pointed out that doing post-model-selection inference as if no model-selection had occurred, leads to problematic properties of the inference.

Björn
  • 21,227
  • 2
  • 26
  • 65