Two Sample Z-test with Non - normal distributions (A/B testing)

Question

I am doing my first steps in statistics and your help will be much appreciated. Sorry for not providing any data, since this is just a made-up exercise that I could not quite find the answer for.

Let's assume we have two separate datasets and each dataset has data for users and their number of purchases. Let's say each dataset has an almost equal number of rows ~4000. Let's assume that both of the data are NOT normally distributed. So we can find the mean, standard deviation for both of the datasets. Sorry, this was longer than expected:) Here is my question: Can I run Z-test and use all of the data (Not Normal) or should I take many samples from each dataset and use the sampled mean data (which would be approximately normal CLT) for the z-test?

If this did not make sense, please let me know so I can improve my question (no hard feelings). Thanks!

An appeal to the central limit theorem might save you from the non-normality, but sampling your sample is not the standard way to proceed. The standard way to proceed would be using the two-sample t-test (likely Welch's test, which is the default in R software). With $8000$ observations, however, the difference between the t-test and the z-test is probably quite minor. — Dave, Aug 04 '21 at 19:37
Dave, thanks for the comment. But would like to ask a clarifying question: would the results of the z-test be not accurate if I used both data entirely? I did a little research usage of t-test requires normal distribution. So I am assuming I should take random samples of 30 (for example) and use already normal data for the t-test? — vvan, Aug 04 '21 at 19:41
Your results should be *more* accurate if you use the entire data set. // You might want to look at the JBStatistics tutorials on YouTube in order to shore up your understanding of some fundamentals. — Dave, Aug 04 '21 at 19:54
As the data is non-normal shouldn't you consider a non-parametric test such as the Manm-Whitney U test, or have I missed something? — Ray, Aug 04 '21 at 20:46
Hi Ray, thanks for your comment. I understand that there are tests that do not assume normality of the data, just wanted to understand when to use and when not to use Z-test! — vvan, Aug 04 '21 at 20:50
@vvan - z-test is based on an assumption that the underlying data is normal. If the data is not normal then an alternative should be used. — Ray, Aug 04 '21 at 20:52
@Ray Depending on who you ask, you should default to Mann-Whitney U rather than the t-test. I want to take that position but see two issues. 1) Despite your previous comment, the t-test is quite robust to deviations from the normality assumption, and colleagues and customers seem more willing to use the tool with which they are familiar. 2) When we violate the normality assumption, we might have other differences that could be of interest (e.g., shape), and I do not want to use a nonparametric test as a way of getting out of considering those possibilities. — Dave, Aug 04 '21 at 20:57
@Dave, the t-test may be robust but the data could in theory be anything. In this case it is count data, discrete and non-negative and not normal. If it is "approximately" normal in some sense, then the results will likely be similar to a non-parametric test but what if the distributuion is very far from "approximately" normal? For 1) I would ask the question why is the customer or colleague telling you to use a particular test? For 2) it is the "shape" of the data, which is really (a sample from) the data distribution, that tells you what test to use - not the other way round. — Ray, Aug 04 '21 at 23:02

BruceET · Accepted Answer · 2021-08-05T07:20:00.827

Consider fictitious data in vectors x1 and x2 as summarized below:

summary(x1);  length(x1);  sd(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.023   3.000   7.000 
[1] 4000        # sample size
[1] 1.014391    # sample standard deviation
summary(x2);  length(x2);  sd(x2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.087   3.000   7.000 
[1] 4000
[1] 1.020878

In the boxplots below, group means are shown as red Xs.

boxplot(x1,x2, horizontal=T, col="skyblue2", pch=20, names=T)
 points(c(mean(x1), mean(x2)), 1:2, pch="x", col="red")

We explore three tests, with comments on the appropriateness of each.

Two-sample Welch t test. The data are integers and so cannot possibly be exactly normal. However, even with sample sizes of 4000, Shapiro-Wilk normality tests do not reject normality.

shapiro.test(x1)$p.val
[1] 0.2268543
shapiro.test(x2)$p.val
[1] 0.7670812

There is a slight difference in sample standard deviations, so we use the default Welch t test in R. With P-value $0.005,$ shows a significant difference between sample mean $\bar X_1 = 2.02$ and $\bar X_2 = 2.09.$ With thousands of purchases, it is possible that this difference is of practical importance.

t.test(x1, x2)

         Welch Two Sample t-test

data:  x1 and x2
t = -2.8235, df = 7997.7, p-value = 0.004761
alternative hypothesis: 
  true difference in means is not equal to 0
95 percent confidence interval:
  -0.10885596 -0.01964404
sample estimates:
mean of x mean of y 
  2.02275   2.08700

Wilcoxon rank sum test. Box plots show samples of essentially the same shape, so it seems reasonable to use the nonparametric two-sample Wilcoxon test to test whether locations of the two samples are the same. (The sample medians are the same, but the test is not precisely testing whether sample medians are significantly different.)

For small samples we would worry about ties in the data, but for samples as large as 4000 the P-value of the test is obtained by using approximations that are not invalidated by ties.

The test finds a significant difference in locations with P-value $0.002.$

wilcox.test(x1, x2)

        Wilcoxon rank sum test 
        with continuity correction

data:  x1 and x2
W = 7699000, p-value = 0.002122
alternative hypothesis: 
 true location shift is not equal to 0

Permutation test. It is difficult to have a cogent discussion about the 'best' test to use here without knowing something about the data. For the t test, one can worry about the non-normality of data; for the Wilcoxon RS test, one can worry about ties and the precise meaning of 'location'.

Although I believe both of these tests are OK for my fictitious data, there is no guarantee that either would work for your actual data (sight unseen).

Of course, there are conditions for use of a permutation test, but IMHO a permutation test is likely to work for a wider variety of data, so I show it here.

In your Question you seem to hint at the use of re-sampling methods. Regardless of its exact distribution, if we are willing to agree that the pooled 2-sample t test statistic is a reasonable way to measure the difference between sample means, we can get a good idea of the actual distribution of the pooled t statistic by using it as the 'metric' for a permutation test.

The idea is that we randomly scramble the 8000 observations into two groups of 4000. For each such random allocation we find the value of the metric (pooled t statistic). Then we can see how likely it is for the metric to be more extreme than the observed value t.obs of the metric for the (un-permuted) observed data.

To facilitate doing the permutation test we put the data into 'stacked' format, as follows:

x = c(x1, x2);  g = c(rep(1,4000), rep(2,4000))
t.test(x1, x2, var.eq=T)$stat
        t 
-2.823541 
t.obs = t.test(x ~ g, var.eq=T)$stat; t.obs
        t 
-2.823541

We use sample(g) randomly to scramble the allocations of the 8000 observations into two groups of 4000 each. The P-value of the simulated permutation test is $0.0052 \pm 0.0014,$ so $H_0: \mu_1 = \mu_2$ is rejected in favor of $H_a: \mu_1 \ne \mu_2.$

set.seed(2021)
t.prm = replicate(10^4, t.test(x~sample(g), var.eq=T)$stat) 
mean(abs(t.prm) >= abs(t.obs))
[1] 0.0052
2*sd(abs(t.prm) >= abs(t.obs))/100
[1] 0.001438538

hist(t.prm, prob=T, col="skyblue2", 
      main="Simulated Permutation Dist'n")
 curve(dnorm(x), add=T, lwd=2)
 abline(v=c(t.obs,-t.obs), col="orange", lwd=2, lty="dotted")

The permutation distribution of the metric is nearly distributed as $\mathsf{T}(\nu = 8000-2),$ which in turn is almost exactly standard normal.

Notes: (1) The program for the permutation test runs slowly because of the large number of observations to be allocated at each iteration and because I used the code t.test as a shortcut to find the pooled t statistic. A run with more iterations would give a more precise P-value within the indicated interval.

(2) In case it is of interest, I show R code for sampling fictitious data in x1 and x2 used above. Data are integers. Roughly, the rationale is that a subject of either type has to have made at least one purchase in order to be in the database. Then according to Poisson distributions (slightly higher mean for x2 than for x1) customers may make modest numbers of additional purchases.

set.seed(804)
x1 = 1 + rpois(4000, 1)
x2 = 1 + rpois(4000, 1.1)

UBruceET thank you so much for sharing this. I will need a little to digest this. Thank you very much fyi I am new here, and can’t upvote your answer… so ( imaginary UPVOTE) . Thanks friend! — vvan, Aug 05 '21 at 06:22
Let me know in case of 'indigestion'. Even first time visitors can click the check mark to Accept an answer. With a few more reputation points you will be able to vote. — BruceET, Aug 05 '21 at 07:21

Two Sample Z-test with Non - normal distributions (A/B testing)

1 Answers1