It is well known that Welch's t-test is robust to violations of the normality assumption and is arguably underused by researchers.1 It is, of course, the default t-test in R. In terms of containing the false positive error rate, just how robust is Welch's test. I'm interested in really punishing the test and seeing how much abuse it can actually take. In running a few simulations, the results are quite remarkable. Statistical tests on sample sizes of n = 3 are routine in published biological research. Check almost any issue of Science. I know this is anathema to statisticians, but it is nevertheless common. So let's take sample sizes of n = 3 from any two distributions we like, set them to have the same mean and simulate the Welch test p values.
Similar simulations were performed here:
"false positive error rate from skewed distributions"
In changing the sample sizes to n = 3, the highest p value I can obtain is about .13.
Similar results are obtained from other skewed distributions such as beta distributions, but the Chi-squared distributions in opposite directions are the most punishing that I can find.
Forget about power. Forget good statistical practice (for now).
What's the highest simulated false positive error rate that anyone can produce from sample sizes of n = 3 from any distributions using Welch's test. Bonus marks for anyone that can provide a proof (not a simulation) of the theoretical upper limit of this.