Bias-variance decomposition for a statistical test

Question

A statistical estimator can be characterized – among others – by its bias and variance. These are different aspects, but if one is willing to accept a certain loss function (risk) than they can merged; the most popular example of which is MSE (which can be shown to be the square of the bias plus the variance). If one is willing to accept MSE as a metric, it can be used to objectively compare estimators even if they have different bias.

My question is: do we have analogous concepts for statistical tests?

Motivation: We all know that – despite being notoriously recommended, even by textbooks – it is a bad idea to select the applied test based on another test performed on the same sample. (Let's say for concreteness: to decide whether to use $t$-test or Welch-test based on an $F$-test, or Levene-test.) Doing so will result in an invalid test, i.e. the distribution of the $p$-values will no longer be uniform, the Type I error rate will be different from the significance level. However, one might say: "OK, I understand that there is some invalidity here, but, hey, we will have higher power!". (As $t$-test has higher power than Welch-test – that's of course the very reason why this flawed strategy is employed at all.)

Now, my feeling is that this invalidity in a statistical test is analogous to the bias of an estimator (actually, it is called bias in mathematical statistics texts; if I understand this concept correctly) and I have the feeling that power is analogous to variance.

So, question #1: are my feelings correct? Do we really have an equivalence here?

Question #2: if so, do we have an analog for MSE? (I accept that we have to select a loss function (risk) for this of course.)

This would be interesting because it would allow us to objectively decide whether there is any merit in the above reasoning (i.e. higher power offsetting the loss of validity – just as a biased estimator can have smaller MSE than an unbiased one).

Why isn't this just an issue of p-value adjustment for multiple testing? — Michael R. Chernick, Mar 29 '18 at 02:13
@MichaelChernick This "preliminary testing" strategy implies a totally different mechanism (one that is much harder to track analytically, actually) for the change in Type I error rate than multiple comparisons. See [this](https://www.ncbi.nlm.nih.gov/pubmed/15171807) article,but it has been also discussed here, see [this](https://stats.stackexchange.com/questions/289449/is-variance-homogeneity-check-necessary-before-t-test) or [this](https://stats.stackexchange.com/questions/61715/choosing-a-statistical-test-based-on-the-outcome-of-another-e-g-normality/61716) in addition to the one I linked. — Tamas Ferenci, Mar 29 '18 at 04:52
General comment: I don't think it's good to use any method that just moves type II error into type I error, i.e., does not work as advertised with respect to type I error. — Frank Harrell, Apr 04 '18 at 13:15
@FrankHarrell I personally agree with you. Nevertheless, I also see some point in the reasoning "OK, the actual alpha is 5.01% instead of the nominal 5% when the null is true, but hey, when its not true, we have 20% more power!". (Of course I made up the numbers.) How should I respond to this? Instead of the "I don't think it's good" and "I personally agree" kind of approaches, an objective, quantitative answer would be of course better. And MSE achieves just this very aim (for an estimator), so that's why I was wondering whether it is possibly to do something like that for a test! — Tamas Ferenci, Apr 04 '18 at 13:31
In the two examples I've seen analyzed in detail the gain in power was exactly equal to the gain in type I error. So unless the procedure is truly shown to only increase alpha to 0.051 when you think it's 0.05 I remain skeptical. — Frank Harrell, Apr 04 '18 at 20:02
@FrankHarrell Ah! I see what you mean! That's exciting; I'll have an empirical (simulated) look on this issue. In the meantime, just two quick comments. (1) Even if you're right, i.e. the error is just redistributed, exactly, I still feel my question is open. Even in that case one might say that 1% gain in power "worth" more for him/her than 1% loss in validity (increase of actual Type I error rate compared to nominal alpha). Yes, such preference would be questionable (insane) indeed, but I'm now trying to make a theoretical point... and one _could_ say he/she has such preference. — Tamas Ferenci, Apr 04 '18 at 21:57
@FrankHarrell So I feel this (loss function) aspect still needs to be addressed. (2) My other problem is that while loss in validity is really a number, the gain in power is not a number, rather a function (as it depends on the true value of course). So, you can't really speak about "gain in power" in reality (as a number). How to compare the loss that can be represented by a single number to a gain that is given by a function...? — Tamas Ferenci, Apr 04 '18 at 21:59
Not sure about the last question, but frequentists tend to put more emphasis on type I error than on type II error. You never see a sample size calculation use a power of 0.95 when alpha=0.05. — Frank Harrell, Apr 05 '18 at 12:58
@FrankHarrell Yes-yes, I see, good point indeed. The question is whether there is any way to make this "tend to put more emphasis" point quantitative (just as loss function (risk) or MSE makes this "emphasis putting" quantitative - for an estimator). — Tamas Ferenci, Apr 05 '18 at 13:58
@FrankHarrell Yes, of course, it's a way out - but we still have to say something to that 90% who sticks to frequentist framework... And even apart from this, it is a well-defined problem mathematically (I believe!), so it'd still interesting to know what the answer is... — Tamas Ferenci, Apr 06 '18 at 09:31
The problem needs to be completely reconceptualized. Write a model that has a parameter for everything you don't know (e.g., ratio of variances for the two groups and a non-normality parameter). Put prior distributions that favor normality and equal variance but allow assumptions to be progressively relaxed as N gets larger. Use standard posterior inference. The only way to really honestly solve the problem and get exact inference. Box and Tiao work this approach out in their book. — Frank Harrell, Apr 06 '18 at 12:48
@FrankHarrell Thank you very much! I'll have a look at it. Are you referring to "Bayesian inference in statistical analysis"...? — Tamas Ferenci, Apr 07 '18 at 12:01
Yes. You might see first if it's in Kruschke's or McElreath's books. — Frank Harrell, Apr 07 '18 at 12:03
@FrankHarrell Wow, thanks! (Statistical rethinking and Doing Bayesian data analysis, am I right...?) — Tamas Ferenci, Apr 07 '18 at 12:14
Yes. Big picture: frequentist framework is so complex for handling pre-testing that frequentists almost never correct the analysis for not being completely pre-specified. — Frank Harrell, Apr 07 '18 at 23:05

Bias-variance decomposition for a statistical test

0 Answers0