Hypothesis testing for two-samples

Question

I have some questions regarding the process of choosing the appropriate statistical test. While I was working in my previous company, we were using T-test, Welch or Mann-Whitney for checking the statistical difference between two sample taking into account a binary metric.

However, now I started to do more research into it and I found out about Z-test, chi-square test etc.

My current understanding is that, I need to know what type of metric I am working with - could be a binary or continuous metric. In my previous company we were checking for a normal distribution and if data was normally distributed with equal variances - we were using T-test. However, when there was a case of unequal variances in sample - we were using Welch test. In case of non-normal distribution, we were using Mann-Whitney test.

I watched a video from a Data Scientist where she doesn't mention checking for a normal distribution in case of binary metrics. She shows to use Z-test or binomial test. For continuous metrics, she uses Z-test or T-test. (picture below)

I know that there are companies which uses chi-square test for binary metrics and Mann-Whitney for continuous...

Could you please clarify the information I provided above? Where should be the proper process?

The diagram uses the bogus 'rule of 30' which declares a t test is OK if $n \ge 30.$ One "justification'" for this rule is that $\bar X$ is approximately normal if $n \ge 30.$ For some distributions (e.g. uniform) $n = 10$ is enough, for others (e.g., exponential) $n=100$ is not enough. Also, the t statistic is not necessarily normal if it's numerator is nearly normal; the numerator and denominator need to be nearlu independent (exactly true only for normal data). // The main right-side branch should be labeled simply "nearly normal." Then branching as to population variance known or not. — BruceET, Jun 13 '21 at 18:51
I have seen such statistics decision trees before, and they have told me the firm needs a statistician on staff. — Dave, Jun 13 '21 at 22:17
Deprecation of the incorrect and misleading 'rule of 30' is not new to this site. My Answer gives a few specific examples. For a more general discussions see [this page](https://stats.stackexchange.com/questions/121852/how-to-choose-between-t-test-or-non-parametric-test-e-g-wilcoxon-in-small-sampl?rq=1). — BruceET, Jun 13 '21 at 22:24

BruceET · Answer 1 · 2021-06-15T16:16:47.013

Appropriate t test for normal data. Consider using a t test to distinguish between two normal samples of size $n = 35.$ one from $\mathsf{Norm}(\mu=100, \sigma=10)$ and the other from $\mathsf{Norm}(\mu=90, \sigma=10).$ The probability of rejection is above $0.995.$ That is the power against a difference of $\Delta = 10$ in means is 98.5%.

Here are boxplots of two such samples:

set.seed(621)
x1 = rnorm(35, 100, 10);  x2 = rnorm(35,82,10)
boxplot(x1, x2, col="skyblue2", pch=20, horizontal=T)

The following simulation looks at results from 100,000 two-sample t tests. (Welch tests are used, but it doesn't matter because the two sample standard deviations are equal, so pooled and Welch tests will have very nearly the same power.)

set.seed(2021)
n = 35
pv = replicate(10^5, t.test(rnorm(n, 100, 10),rnorm(n,90,10))$p.val)
mean(pv <= .05)
[1] 0.98497

For non-normal data a t test is not appropriate. By contrast, consider sample of the same size, but from delayed exponential distributions $90+\mathsf{Exp}(0.1)$ (with $\mu = 100, \sigma = 10)$ and $83+\mathsf{Exp}(0.1)$ (with $\mu = 94, \sigma = 10).$

Here are boxplots of two such samples:

set.seed(613)
x1 = 90 + rexp(35,.1);  x2 = 83 + rexp(35,.1)
boxplot(x1, x2, col="skyblue2", pch=20, horizontal=T)

The inappropriate t test has only about 82% power. (A better test might take advantage of the different hard minimums $(90$ and $83)$ of the two distributions, if known.)

set.seed(2021)
n = 35
pv = replicate(10^5, t.test(90+rexp(n,.10),83+rexp(n,.10))$p.val)
mean(pv <= .05)
[1] 0.82108

Here, the distinction between the two populations is a shift in location, so a nonparametric, two-sample Wilcoxon rank sum test is appropriate. Also, it has much better power (here about 97%) than the inappropriate t test.

set.seed(2021)
n = 35
pv = replicate(10^5, wilcox.test(90+rexp(n,.10),83+rexp(n,.10))$p.val)
mean(pv <= .05)
[1] 0.96645

When the two samples have about the same non-normal shape, a Wilcoxon rank sum test is a better choice than a t test, regardless whether sample sizes are above or below 30.

For the particular samples in the boxplot above, the Wilcoxon RS test gives P-value $0.0007$. as shown below:

set.seed(613)
x1 = 90 + rexp(35,.1);  x2 = 83 + rexp(35,.1)
wilcox.test(x1, x2)

        Wilcoxon rank sum test

data:  x1 and x2
W = 897, p-value = 0.0006731
alternative hypothesis: 
  true location shift is not equal to 0

Hypothesis testing for two-samples

1 Answers1