2

Before I begin, I would like to note that I am familiar with the ordinary debate between one-tailed and two-tailed tests and the gazillions of questions already on Cross Validated about the matter.

Essentially, though, it seems that there is no theoretical reason why a sort of "pick the best tail and test it" method would not exist. In fact, this is more-or-less what happens in practice, the results are just bootstrapped out of what is essentially a larger dataset.

For example, I am interested in choosing the better of treatment A vs treatment B. The ordinary scientific process here would be to gather a small exploratory dataset, take a look at the results, and decide informally that there is sufficient evidence that one or the other actually performed better to warrant further study. I then gather a larger dataset, and perform a one-tailed test in the direction suggested by my exploratory analysis.

However, putting the small study and the large study together, it seems that in reality what I have is a single process that simultaneously picks a direction and also performs a test in that direction. This seems to contrast, in a theoretical sense, with the usual advice to not use your data to "pick a direction" to subsequently test - but this advice seems to stem strictly from the practical consideration that the usual formulae are not designed to use the same dataset to both choose a direction and also test in that direction.

Taking this into consideration, though, it seems like perhaps splitting my data into an "exploratory" subset and a "confirmatory" subset is maybe not the best usage of my limited data resources. In the sense that perhaps stronger evidence for my one-tailed hypothesis actually exists in the data than my process of splitting the data into two separate experiments would produce in terms of $p$-value.

Is my reasoning here flawed somehow? If not, does such a superior method exist for performing the "pick the best tail and test it" procedure?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Him
  • 2,027
  • 10
  • 25
  • This seems like a good place to use a simulation study: is the loss of power due to decreasing the sample size worth the gain in power by doing a one-sided test? – Dave Oct 07 '21 at 18:46
  • @Dave IMHO the gain in power is secondary to the later interpretive power. "X does better than Y" is usually a vastly superior result in terms of real-world usefulness than "X is not equal to Y", which oftentimes is obvious before a study even begins. – Him Oct 07 '21 at 19:13
  • So if you observe $\bar x = 1$ and $\bar y = 3$ with a significant two-sided p-value ($<0.05$ or whatever), you wouldn't say that that $\mu_y$ is greater than $\mu_x$, just that $\mu_x$ and $\mu_y$ are unequal? – Dave Oct 07 '21 at 19:28
  • @Dave I would say that $\mu_y > \mu_x$ is worth following up on. The quantitative result that you computed (the $p$ value) is probably meaningful in some way, but it doesn't mean what you probably wish that it would mean, which is that $\mu_y > \mu_x$ at some "level of confidence". This interpretation crucially hinges on the a priori hypothesis. – Him Oct 07 '21 at 20:56
  • You certainly can claim that $\mu_y > \mu_x$ in that case. – Dave Oct 07 '21 at 21:25
  • @Dave I mean, you can claim anything you like, I suppose. :) The point of running the test, though, is so that you can quantify the strength of your evidence. The "quantification" ($p$) in this case only applies to $\mu_y \neq \mu_x$. You get no metric for strength of evidence on $\mu_y > \mu_x$. So, then, if you're trying to convince someone that $\mu_y > \mu_x$ and they ask "how do you know?", and you say "well, the data provide evidence, but we're not sure how much evidence.".... that's not very convincing. – Him Oct 07 '21 at 22:56

0 Answers0