Paired vs independent t-test for A/B test with underlying trends

Question

There is a metric which has a natural cyclic pattern. We want to measure the effect on this metric through a A/B test.

Examples:

Metric is daily revenue of ice-cream trucks which is low during weekdays and high on weekends; A/B test to check which of 2 music causes higher revenue. Each truck is assigned randomly to treatment. Everyday we get 2 data points — average revenues per truck in treatment A and B. The null hypothesis is that both music have the same effect on revenue.

Metric is revenue from an e-commerce site which is low during weekdays and high on weekends; A/B test to check which of 2 site layouts causes higher revenue. Visitors to the website are assigned randomly to treatments. Everyday we get the 2 data points — average revenue per visitor in treatment A and B. The null hypothesis is that both layouts have the same effect on revenue.

A underlying cyclic pattern on the metric violates the normal assumption and results in high SD when the samples are assumed to be i.i.d. This in turn leads to extremely large sample size for measuring small lifts. A paired t-test alleviates this somewhat. But all paired t-test examples seem to be centered around "multiple measurement of the same subject" idea.

My understanding is that the independent sample t-test is wrong simply because the samples are not i.i.d. (mean shifts WRT time) — this leaves out most tests; even permutation test which does not assume a known distribution. Paired t-test seems like a plausible idea, but so far have not encountered a similar recommendation.

Is there a simple test that can be applied here?
Or do we need to go down a "trend removal" technique — then apply t-test?

Here's a synthetic example in python (run code):

import numpy as np
from scipy import stats

x_data = np.linspace(0,1,101)
num_period = 3
treatment1 = np.sin(num_period*2*np.pi*x_data) + 1  # cyclic data
treatment2 = treatment1 + np.random.normal(0.05,0.05,len(treatment1))   # T1 + N(0.05,0.05)

stats.ttest_ind(treatment1,treatment2)
# Ttest_indResult(statistic=-0.5252661250185608, pvalue=0.5999800249755889)
stats.ttest_rel(treatment1,treatment2)
# Ttest_relResult(statistic=-10.13042526535737, pvalue=5.12638080641741e-17)
```

Can you clarify the design? Are you measuring trucks over the entire week, or just on a few days? Are you interested in additive effects, or is your hypothesis that the trucks given the treatment only sell more on the weekend? — Demetri Pananos, Jul 19 '20 at 21:23
@DemetriPananos We are measuring the average revenue per truck per day for each treatment. Hypothesis is that one of the treatments cause a higher revenue in aggregate (not limited to weekdays or weekends). Does that help? — BiGYaN, Jul 20 '20 at 03:42

BruceET · Accepted Answer · 2020-07-19T18:58:52.293

Pairing of some kind seems crucial because you want to compare Truck A on Wednesdays with Truck B on Wednesdays. However, as you say, a cyclic sales pattern may tend to be non-normal (but see Note at end). In order to have pairing without concern over normality, you might use a paired Wilcoxon test. It seems especially appropriate because the weekly distributional pattern will be similar for the two trucks.

Fake data for just one week and paired Wilcoxon test, in R:

x1 = c(120, 75, 80, 70, 85, 82, 130)
x2 = c(130, 89, 91, 79, 93, 99, 142)  # consistently higher
wilcox.test(x1,x2, pair=T)

         Wilcoxon signed rank test

data:  x1 and x2
V = 0, p-value = 0.01563
alternative hypothesis: 
  true location shift is not equal to 0

The null hypothesis that the two trucks have similar sales is rejected with P-value 0.016 < 0.05, even though there is a weekly trend of higher sales on Sun and Sat.

A two-sample Wilcoxon test without pairing does not detect that the second truck has consistently higher sales. [There is a warning message about ties (not shown here), so the P-value may not be exactly correct.]

wilcox.test(x1,x2)$p.val
[1] 0.1792339

Note: In judging normality for a paired t test, it is the paired differences that should be tested for normality. They may not show as aggressive a weekly pattern as do sales by individual trucks.

Thanks for the reply @BruceET. I agree with the logic. Are there some reference materials on this specific use of paired test (either `Wilcoxon` or `t-test`) for these kind of AB test problems? For my use case, a citable reference would help. — BiGYaN, Jul 22 '20 at 17:41
Nothing here is really new, so most intermediate level statistics books have details. I have most recently taught from Ott & Longnecker, which has over 1100 pages and maybe not ideal organization, but that's one reference. — BruceET, Jul 22 '20 at 18:35

Demetri Pananos · Answer 2 · 2020-07-20T03:46:35.083

One approach might be to used a mixed model with an indicator for day + a random effect for truck ID. This way, you can account for any truck level variation and assess the effect of the treatment via an indicator. This sounds feasible especially if you have lots of data to make up to the degrees of freedom being used by the indicators.

Here is an example of how this might be performed. I have 10 trucks, each truck's sales are measured over the course of a week. We assume that each truck has some differences due to the driver (or something, maybe one truck is newer and is more attractive than older ones, who knows). The hypothesized intervention increases sales by 2 units. Here is a plot of the data where each line is for a specific truck with colors indicating treatment group.

A linear mixed effect model for this data may look like


model = lmer(sales ~ factor(ndays) + trt + (1|truck), data = design )

The test you case about the the test for the trt variable, assuming you hypothesize additive effects (sales increase by the same amount on each day, not just on weekends). Here is a plot of the model for each truck with the data plotted over the model fit with an opacity.

Finally, I'm sure there is a way to do this without mixed effect models. In my own opinion, regression is a natural way to think of these sorts of comparisons, but a cleverly computed t-test is likely capable of accomplishing the same thing. Think of this approach as the most straight forward (in so far as it directly considers the generative processes) but perhaps not the easiest or even best.

"regression is a natural way to think of these sorts of comparisons" +1, agreed, and this connects with "you want to compare Truck A on Wednesdays with Truck B on Wednesdays" in Bruce's answer — Adrian, Jul 20 '20 at 04:03
The indicator for day and the regression are both new ideas to me. I was thinking around predict the underlying trend and then subtracting it out. Thanks for this approach. — BiGYaN, Jul 21 '20 at 07:17
@BiGYaN If this answer has given you what you need, please consider accepting it. Else, I can edit it to give more precise advice. — Demetri Pananos, Jul 22 '20 at 13:47

Paired vs independent t-test for A/B test with underlying trends

2 Answers2

Linked