T-test states difference of donation is significant when Z-test claims not, what method to use?

Question

I have two populations who have been exposed to two different websites that should bring them to donations: one with a progress bar that pushes them to give (B, segment 2) and the other not (A, segment 1).

And with log(y):

I have noticed that, on average, population B gives much more than A:

                 s1          s2
count   3352.000000 3053.000000
mean    86.137828   109.417294
std     239.235495  231.897494
min     2.000000    3.000000
25%     20.000000   25.000000
50%     30.000000   50.000000
75%     60.000000   100.000000
max     9000.000000 6200.000000

But the mean of sampled donations looks normal:

means = []
for i in range(0,10000):
  means.append(df["Amount Eq Euro"].sample(8007, replace=True).mean())

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

np.random.seed(42)

plt.hist(means, density=True, bins=30)  # density=False would make counts
plt.ylabel('Probability')
plt.xlabel('Mean of sample donations');

So I wanted to know what method should I use to test that. Should I use a t-test or an z-test? Because a colleague chose the t-test and found out that the difference was significant but I chose z-test and didn't.

Z-test

Indeed as we’re interested in the average donation, this averaging of an underlying distribution meant to mean that our final estimate cound well be approximated by a normal distribution. Which could look like this:

df_segment_2 = df[df.Campaign.str.contains('segment 2')]['Amount Eq Euro']
df_segment_1 = df[~df.Campaign.str.contains('segment 2')]['Amount Eq Euro']

num_a, num_b = df_segment_1.count(), df_segment_2.count()
mean_a, mean_b = df_segment_1.mean()    , df_segment_2.mean()
std_a, std_b = df_segment_1.std()   , df_segment_2.std()

# The z-score is really all we need if we want a number
z_score = (mean_b - mean_a) / np.sqrt(std_a**2 + std_b**2)
print(f"z-score is {z_score:0.3f}, with p-value {norm().sf(z_score):0.3f}")

But when doing the difference between the two curves I find that the z-score is 0.070, with p-value 0.472. So it's not significant.

I know that the difference from the Z Test is that we do not have the information on Population Variance here. We use the sample standard deviation instead of population standard deviation in this case. But in my case, I can get the standard deviation from my data, isn't it? Or should I need to do a population standard deviation and use t-test?

I even simulated with:

n = 10000
means_a = norm(mean_a, std_a).rvs(n)
means_b = norm(mean_b, std_b).rvs(n)
b_better = (means_b > means_a).mean()
print(f"B is better than A {b_better:0.1%} of the time")

And found out that B is better than A only 53.0% of the time.

T-test

I even tried with t-test

import scipy.stats as stats

stats.ttest_ind(a=df_segment_1, b=df_segment_2, equal_var=True)

Which returns:

Ttest_indResult(statistic=-3.946818060072667, pvalue=8.004451431980152e-05)

So it rejects the hypothesis as p_val is 8.004451431980152e-05 so it is significant.

I don't understand why, I don't understand what this test stands for. I thought it was just convenient when we had less than 30 people in a given population

Extra information for @Dave

Here are the statistics for the whole population

count    6405.000000
mean       97.234192
std       236.034497
min         2.000000
25%        20.000000
50%        40.000000
75%        80.000000
max      9000.000000
Name: Amount Eq Euro, dtype: float64

Do you mean a z-test that uses the population variance, or do you mean a z-test of proportions? — Dave, Jun 15 '21 at 16:27
Hmm ... I don’t know @Dave ? My z-test uses the population variance to test whether the average donations augmentation augmented significantly. — Revolucion for Monica, Jun 15 '21 at 16:40
What is the population variance, and how do you know that is the population variance? // Also, is there any notion of pairing in your data? You mention that B is better than A 53% of the time, but that statement only makes sense to me if the observations are paired. — Dave, Jun 15 '21 at 16:41
The population variance (computed on the whole data) is 236.034497^2. I got it by taking the whole population on which I do both tests. — Revolucion for Monica, Jun 15 '21 at 17:11
Do you really need a test? You have data from about 3000 potential donors in each gp you averaged 86.14 from gp A and 109.42 from gp B. (And if I read your graph correctly, most of the really large donations in B). For me, that would settle that B is best. // In Minitab (which accepts summary data) pooled and Welch 2-sample t tests both give t statistics about 4, which is very highly significant, P-value < 0.0005. // With such large sample sizes it probably doesn't matter, but neither population seems anywhere near normal. // Without seeing formulas you used, can't trouble-shoot computations. — BruceET, Jun 15 '21 at 17:12
If you have the population, then there is no such thing as inference (such as hypothesis testing). You have that your B-group outperformed your A-group, If, however, you have a sample and want to infer something about the population (such as future users), then you do not know the population values, so you do not know the population variance, and the z-test is inappropriate without some appeal to convergence of t-distributions to standard normal. — Dave, Jun 15 '21 at 17:17
I just added the formula used in Python @BruceET (and thanks for the Minitab reference, unfortunately I can't buy it as in our charity we are not rich enough :p) — Revolucion for Monica, Jun 15 '21 at 17:21
@RevolucionforMonica "The population variance (computed on the whole data)" that is a *sample* variance. — Alexis, Jun 15 '21 at 17:23
You are permitted to read the output I posted without buying Minitab — BruceET, Jun 15 '21 at 17:23
Understood @Dave . As I want to infer something about the population especially future users. I don't know the population values. I didn't understood your point about appealing to convergence of t-distributions to standard normal — Revolucion for Monica, Jun 15 '21 at 17:23
Ok, I just got the results of the t-test and I'm starting to get it, @Dave . would really be interested if you could post an answer for a such beginner in statistics like me :) — Revolucion for Monica, Jun 15 '21 at 17:35
@Galen yes ! It is left truncates because people can only give more than 0 euros :) — Revolucion for Monica, Jun 15 '21 at 17:44
@RevolucionforMonica That does explain the left-truncation nicely. — DifferentialPleiometry, Jun 15 '21 at 17:59
@RevolucionforMonica Making inferences about the future may or may not be possible in your case. You may find the subject of time series analysis valuable. For one thing, you might wish to learn if the donations are a stationary process. — DifferentialPleiometry, Jun 15 '21 at 18:00
Thanks for your insight @Galen Yet 1. these donations result of “campaigns” so they are very related to these punctual events. So is it worth it to analyse this data under the time series scope analysis ? 2. My main goal is to know wether the population who has been confronted to the process bar has/will donate more than the other. How much does time series analysis helps me to answer this question? — Revolucion for Monica, Jun 15 '21 at 23:25
My comment about timeseries pertains to making forecasts of future donations, but that may not be suitable for this case. — DifferentialPleiometry, Jun 15 '21 at 23:35
For your main goal you can do the statistical inference about whether one group or the other donated more. See my answer below for that part. — DifferentialPleiometry, Jun 15 '21 at 23:37
I am skeptical that you will be able to reliably infer what future populations of donaters will do from comparing these groups. It may suffice for providing a recommendation, but I wouldn't promise certain results to anyone if I were you. — DifferentialPleiometry, Jun 15 '21 at 23:39

DifferentialPleiometry · Accepted Answer · 2021-06-15T18:14:15.060

2

You should not be using t-tests or z-tests because your histograms clearly show that your data is not even remotely normally-distributed. Slight violations of normality are alright, but your data appears to have left-truncation and right-skewness. The tailedness of your distribution might even prevent meaningful estimation of certain parameters. Watch Risk and Fat Tails for an introduction to how properties of tails become problematic for unbiased estimation.

A/B testing is not a specific hypothesis test. It is a term for a randomized experimental design with two groups, and thus is compatible with multiple statistical procedures and tests.

The good news is that you seem to have a decent sample size compared to the number of variables, which allows you to look into models that may require more degrees of freedom. A larger sample size also improves statistical power.

However, you seem to be looking for a simple test that one variable was (stochastically) larger than another. In which case, I recommend you read into the Mann-Whitney U test which has an implementation available in SciPy (since you're a Python coder).

edited Jun 15 '21 at 18:14

answered Jun 15 '21 at 17:40

DifferentialPleiometry

2,274
1
11
27

Many thanks for your answer Galen. I am finishing watching the video of Pasquale Cirillo. I have added an updated graph of the whole distribution, is it better to give the big picture? Should I still assume a right-skewness/fat tail? Furthermore I understand that my data is data is not even remotely normally-distributed. But what about the difference of population? Shouldn't that suffice to test a more simple test (like Z or T)? Don't hesitate to explain it to me as if I was the dumb begineer (but quick learner) I really am :) – Revolucion for Monica Jun 16 '21 at 08:39
I didn't understood this sentence: *"you have a decent sample size compared to the number of variables, which allows you to look into models that may require more degrees of freedom".* I am new to _degrees of freedom_ but If I have well understood [the definition](https://www.investopedia.com/terms/d/degrees-of-freedom.asp), here, if my population is of size $N$, I have $N$ degrees of freedom. I can't have less? So theoretically I can have a look at all models? Sorry, I'm really diving into abstract concepts I don't sufficiently master yet. – Revolucion for Monica Jun 16 '21 at 08:55
I've read from [the WMU implementation available in SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html) that it outputs a *p-value assuming an asymptotic normal distribution*. As my distribution isn't really normal as you mentioned. Can I still use this test? – Revolucion for Monica Jun 16 '21 at 10:24
@RevolucionforMonica You raise a good point about distinguishing the random variables from their differences. If you have paired values, you could check if the paired differences are approximately normal. If they are, then a paired t-test will be suitable. There's not much point in using the z-test because the t-distribution approaches the normal distribution due to [CLT](https://en.wikipedia.org/wiki/Central_limit_theorem). – DifferentialPleiometry Jun 16 '21 at 15:22
@RevolucionforMonica Degrees of freedom is a difficult subject. You might wish to study through [this thread](https://stats.stackexchange.com/questions/16921/how-to-understand-degrees-of-freedom). – DifferentialPleiometry Jun 16 '21 at 15:25
1

@RevolucionforMonica I am happy to learn that you're reading the documentation for the tools you use. Anyway, the $p$-value is calculated from a null hypothesis distribution that is assumed to be the normal distribution in their implementation. This is because the test statistic will approach a normal distribution as the sample size increases. Because the [U-statistic](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#U_statistic) is restricted to summands that are 0, 1/2, or 1, the actual values of the original data don't matter; only the inequalities between groups. – DifferentialPleiometry Jun 16 '21 at 15:34
@RevolucionforMonica Yes, your distribution is right-tailed. – DifferentialPleiometry Jun 16 '21 at 15:43
Sorry for still being worried about using or not this t-test, but doesn't the Central Limit Theorem allows me to use the t-test? To find the population means? – Revolucion for Monica Jun 21 '21 at 12:26
If you had an infinitely-sized sample, then definitely. If your data was roughly symmetric and unimodal, then probably. But your data looks possibly fat-tailed with some unknown shape parameter. While CLT should eventually normalize you means, estimates from fat-tailed distributions will converge much more slowly (if it has a finite mean). – DifferentialPleiometry Jun 21 '21 at 14:37
1

One option you have is to bootstrap the means from your original data, and analyze whether the distribution of those estimates are approximately normal. If they are, then a t-test should be fine. Otherwise, use something else. – DifferentialPleiometry Jun 21 '21 at 14:39

Dave · Answer 2 · 2021-06-15T18:03:58.453

2

You do not know the population variance. You calculated the sample variance and expect that value to be close to the population variance, but you do not know $\sigma^2$.

Therefore, you may have underestimated the variance. To offset this, we use a test statistic with a heavier tail than the normal distribution. The way the math works out for a test statistic of $\dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}$, the test statistic is distributed as $t_{n-1}$.

As you get a larger and larger $n$, you expect to have a tighter estimate of $\sigma^2$, so if you underestimated the variance, you expect to underestimate by less and less. The tails of the test statistic, therefore, can get lighter. In the limit, as $n\rightarrow\infty$, you know the population variance, and there is a notion of the $t_n$ distribution converging to $N(0,1)$ as $n\rightarrow\infty$. For this reason, one might argue that, for a very large (intentionally vague terminology) $n$, the $z$-test might be a reasonable approximation of the $t$-test. I am not sure that I have seen anyone do this, however, and with any modern data science software having t-test tools built in, I see little reason to do this particular approximation.

This is why your colleague argues that $t$-testing is appropriate here, not $z$-testing.

edited Jun 15 '21 at 18:03

answered Jun 15 '21 at 17:45

Dave

28,473
4
52
104

There are valid arguments for why you might want to use a Wilcoxon Mann-Whitney U test (`wilcox.test` in R), but this post only concerns $t$ vs $z$. – Dave Jun 15 '21 at 17:45
That is a fair comment. WMW U test is not in the question. However, regardless of available alternatives, the OP should not be using either $t$ or $z$ with this data. – DifferentialPleiometry Jun 15 '21 at 17:48
Many thanks for your answer. Yet, I didn't understood this sentence *"The way the math works out for a test statistic of $\dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}$, the test statistic is distributed as $t_{n-1}$"*. Does it mean that the test statistic of the normalized distribution is distributed as $t_{n-1}$? Don't hesitate to explain it to me as if I was a teenager/total begineer. I can understand quickly but I don't have tons of background :) – Revolucion for Monica Jun 16 '21 at 07:49
1

In the z-test, the test statistic is similar to that fraction but with $\sigma$ instead of $s$. That is compared to a standard normal distribution. When we use the estimated standard deviation $s$ instead of $\sigma$, we compare to $t_{n-1}$. – Dave Jun 16 '21 at 09:45

score 1 · Answer 3 · answered Jun 15 '21 at 17:21

Comment continued to show Minitab output for pooled 2-sample t test. Maybe you can compare it with your computations.

Two-Sample T-Test and CI 

Sample     N  Mean  StDev  SE Mean
1       3352    86    239      4.1
2       3053   109    232      4.2

Difference = μ (1) - μ (2)
Estimate for difference:  -23.28
95% CI for difference:  (-34.84, -11.72)
T-Test of difference = 0 (vs ≠): 
 T-Value = -3.95  P-Value = 0.000  DF = 6403

Both [test & CI] use Pooled StDev = 235.7018

T-test states difference of donation is significant when Z-test claims not, what method to use?

Z-test

T-test

Extra information for @Dave

3 Answers3

Linked