1

I have researched quite a bit to try an answer on my own but found very contradicting answers.. Would appreciate if this community can help..

I set up a test to measure the impact of an offer sent via email to customers in the test group.. Three months have passed and my test group constitutes ~ 164K customers, while control has ~ 18K customers (90:10 split). Now I am trying to determine the impact of the offer on test group.. I want to determine if mean number of transactions of test is significantly higher than control.

The problem I have is most of the customers ~95% do not do any transactions in the three months period which makes the distribution heavily right skewed. Which test/methodology can i use to determine if the mean of test > control.. The two sample parametric test has an assumption that the populations should be normally distributed which is so not true in my case and I've read in few places that Mann- Whitney shouldn't be used for comparing means. Plz advise

im_sdubey
  • 13
  • 3

1 Answers1

0

In Web testing, given your sample size and provided your responses are not extremely skewed, the sample means can still be approximately normally distributed, and Welch's $t$-test (for unequal variance and sample size) can still be applied. You won't get an exact p-value due to violation of assumptions on $t$-distributed random variables, but the approximation should be good enough.

There is no clear cut answer for what constitute extremely skewed responses. This question provides a case of an extremely skewed data, which is unlikely your case given your business metric is "mean number of transactions [per customer]". It might be the case if your metric is "mean spend [per customer]". The answer to this question (by myself) provided a rule of thumb by Kohavi et al. (2014) on what kind of skewness can a $t$-test deal with.

If in doubt, the community's suggestion is to run some simulations to see how the sample mean behaves. One option under such approach is to simulate the distribution of the sample means by bootstraping, and then compare the sample mean distributions (not the distribution for the two set of responses) using Mann-Whitney. A note of caution that if you decided to use Mann-Whitney in this scenario, the result will be dependent on the number of bootstrap samples you have, not the number of original samples.

B.Liu
  • 1,025
  • 5
  • 17
  • Thank you B. Liu! This was very helpful, much appreciated.. – im_sdubey Jan 17 '21 at 07:06
  • Hi @B. Liu, I was thinking a bit more on the last part where you suggested to run Mann Whitney on the sample means distribution.. Can you elaborate on that a little bit more pls. – im_sdubey Jan 18 '21 at 01:24
  • 1. If after bootstrapping I see that the two sample means distro are ~ Normally distributed that would mean that I can use the results from the Welch test.. correct? 2. If I see upon bootstrapping that the distribution are not normally distributed but the shape is similar and standard dev is similar, I can use Mann Whitney U to check for median..3. If however, the distribution of mean for the two samples after bootstrapping is not similar or standard dev. is quite different, can i still use Mann Whitney? – im_sdubey Jan 18 '21 at 01:25
  • @im_sdubey 1. Practically yes if you mean running a Welch’s t-test on the original samples, and if we are sticking to experiments on the Web (i.e. with enough users). 2 & 3. Again, practically yes. Mann-Whitney is a versatile tool that can be applied to compare the original sample distributions, or the bootstrapped mean sample distributions. The conclusion one make from the test is different though - see note of caution below. Welch’s t-test on the original samples also remains a practical option if the skewness of the original samples are within an acceptable range. (1/3) – B.Liu Jan 18 '21 at 13:09
  • @im_sdubey Notes of caution on Q2/3: By using Mann-Whitney, you are not checking for the median, but comparing the distributions directly. In other words, you are no longer comparing two summary statistics, but are effectively asking if a sample from one group is greater than that of another _in general_. As you pointed out, this has nothing to do with the mean, though it has nothing to do with the median either. (2/3) – B.Liu Jan 18 '21 at 13:09
  • @im_sdubey An obvious downside is the loss of ability to tell business stakeholders the usual “the mean # transactions in the test group is x% higher than that of control group” with enough statistical rigour, because, well, Mann-Whitney is not parametric to start with. However, if you know the two distributions to be compared under Mann-Whitney are made out of bootstrap mean samples, there is little practical difference IMO between saying “one (mean) distribution is greater than another” and “the mean of a group is higher than another” provided the statistical test is not too sensitive. (3/3) – B.Liu Jan 18 '21 at 13:10
  • Thank you B. Liu! Greatly appreciate your comments – im_sdubey Jan 31 '21 at 20:59