Aggregation-Level in AB-Tests

Question

Hello fellow number crunchers

I hope this a valid question for this forum. I am a lonesome quarter-statistician and have trouble finding someone to ask.

Introduction:

The AB-Test has become really popular since it is so easy to implement and execute. Additionally the web is floated with blogs explaining how to determine the significance of the results. All in all, it seems that there is less discussion about the control or exclusion of possible "influential" variables (on the other hand, controlling such variables is quite hard on the web).

Most AB-Tests comparing the outcomes of both groups by simple counting how many clicks or conversions every group has generated. Than a binomial distribution is assumed for each group and hence statistical tests are performed to see which group got the greater p.

So the question is:

Is it "better" to compare the outcome of both groups without any aggregation or it is "better" to aggregate e.g. on daily basis ?

Example: The AB-Test is to check whether a landing page creates more newsletter subscriptions (<- conversions in this case). The AB-Test is deployed/online the whole test-time. In group A the landing page's main color is blue, in group B the main color is red. Assume 2000 visitors per day, i.e. each group gets roughly 1000 visitors. In this case "without aggregation" means, that I get 1000 datapoints per group per day meanwhile "aggregation on daily basis" means, that I get one (!) datapoint per group per day.

Discussion

The latter (aggregation on daily baiss) would allow the pairing of values, which in turn can capture daily effects like peaks in user behavior and preferences, but it extends the duration of AB-Tests, because it takes longer to collect enough datapoints. On the other hand, the former (no aggregation) seems to the strategy of the majority, because ... I dont know, maybe because a) with enough traffic you can make nearly anything significant within one day b) it is the easiest thing to do.

One (possibly influencing) example to stimulate your thoughts: Assume that on one day during the test the "National Blue Day" is celebrated, so the color blue together with positive emotions is visible in all the media. This day group A has created a ton of conversions more than group B.

This difference clearly affects the test if aggregated on daily basis (increase of variance) or (if not aggregated at all) it either vanishes in the sea of data (in the case that multiple days are collected without aggregation) or it leads to the wrong results (if the day is the first and only day of the test).

Another example: Assume that the landing page belongs to a vegetable company. One day the "National Vegetable Day" is celebrated and now everyone wants to subscribe to the newsletter, no matter what the color is. This short-time effect is captured by aggregation on daily basis and a paired test, but it increases the variance in the case of no aggregation (because no paired test can performed here (is this even correct ?))

All in all: Am I on the right track or do I miss something completely ?

What is an AB-test? A link, a brief explanation etc may help. — , Nov 24 '10 at 16:27

Andy W · Answer 1 · 2011-06-23T17:24:00.630

If the treatment is randomly assigned the aggregation won't matter in determining the effect of the treatment (or the average treatment effect). I use lowercase in the following examples to refer to disaggregated items and uppercase to refer to aggregated items. Lets a priori state a model of individual decision making, where $y$ is the outcome of interest, and $x$ represents when an observation recieved the treatment;

$y = \alpha + b_1(x) + b_2(z) + e$

When one aggregates, one is simply summing random variables. So one would observe;

$\sum y = \sum\alpha + \beta_1(\sum x) + \beta_2(\sum z) + \sum e$

So what is to say that $\beta_1$ (divided by its total number of elements, $n$) will equal $b_1$? Because by the nature of random assignment all of the individual components of $x$ are orthogonal (i.e. the variance of $(\sum x)$ is simply the sum of the individual variances), and all of the individual components are uncorrelated with any of the $z$'s or $e$'s in the above equation.

Perhaps using an example of summing two random variables will be more informative. So say we have a case where we aggregate two random variables from the first equation presented. So what we observe is;

$(y_i + y_j) = (\alpha_1 + \alpha_2) + \beta_1(x_i + x_j) + \beta_2(z_i + z_j) + (e_1 + e_2)$

This can subsequently be broken down into its individual components;

$(y_i + y_j) = \alpha_1 + \alpha_2 + b_1(x_i) + b_2(x_j) + b_3(z_i) + b_4(z_j) + e_1 + e_2$

By the nature of random assignment we expect $x_i$ and $x_j$ in the above statement to be independent of all the other parameters ($z_i$, $z_j$, $e_1$, etc.) and each other. Hence the effect of the aggregated data is equal to the effect of the data disaggregated (or $\beta_1$ equals the sum of $b_1$ and $b_2$ divided by two in this case).

This exercise is informative though to see where the aggregation bias will come into play. Anytime the components of that aggregated variable are not independent of the other components you are creating an inherent confound in the analysis (e.g. you can not independently identify the effects of each individual item). So going with your "blue day" scenario one might have a model of individual behavior;

$y = \alpha + b_1(x) + \beta_2(Z) + b_3(x*Z) + e$

Where $Z$ refers to whether the observation was taken on blue day and $x*Z$ is the interaction of the treatment effect with it being blue day. This should be fairly obvious why it would be problematic if you take all of your observations on one day. If treatment is randomly assigned $b_1(x)$ and $\beta_2(Z)$ should be independent, but $b_1(x)$ and $b_3(x*Z)$ are not. Hence you will not be able to uniquely identify $b_1$, and the research design is inherently confounded.

You could potentially make a case for doing the data analysis on the aggregated items (aggregated values tend to be easier to work with and find correlations, less noisy and tend to have easier distributions to model). But if the real questions is to identify $b_1(x)$, then the research design should be structured to appropriately identify it. While I made an argument above for why it does not matter in a randomized experiment, in many settings the argument that all of the individual components are independent is violated. If you expect specific effects on specific days, aggregation of the observations will not help you identify the treatment effect (it is actually a good argument to prolong the observations to make sure no inherent confounds are present).

Thank you for your formula based foundation of my gut feeling ;). Two questions: 1. What do you mean by "(it is actually a good argument to prolong .." ? Do you mean this: By prolonging the test one gets more observations for a) special days and hence can identify corresponding effects or b) normal days and hence special days become less important ? 2. Where can I learn more about this science ? Is this the area called "Design of Experiments" ? — mlwida, Apr 20 '11 at 11:46
@steffen , for the first question it is not that special days become less important, it is that you have observations independent of those special day effects. Hence you observe the process in different settings so you know the observed treatment effect is not due to those special days. Similar to what svadali stated about generalizing the findings, if you only run the experiment on blue days you don't know if the results would be the same on normal days. — Andy W, Apr 20 '11 at 12:38
For the second question its difficult to recommend readings not knowing more about your work. My heart is close to this book, http://books.google.com/books?id=o7jaAAAAMAAJ&q , but perhaps a book more focused on typical experimental designs is more appropriate (the above link is for a widely known book that mainly focuses on non-experimental research designs). The AB test as far as I can tell is no different than a factorial experiment with only one factor. — Andy W, Apr 20 '11 at 12:45

score 1 · Accepted Answer · answered Nov 24 '10 at 18:09

1

The right level of aggregation depends on the time period over which you wish to generalize.

For example, you want to deploy A during nights across several sites but are unsure about its effectiveness relative to the existing option B. Thus, you may deploy A over a small number of sites and see its effects relative to the alternative B. In such a scenario, you need to aggregate the effects of A across all the nights that it was deployed to assess the relative impact of A vs B.

To use your example from the last para: If the interest lies in evaluating the impact of A across all days (possibly because A will be deployed on all days) then the 'right' thing to do is to aggregate across all days so that the test of A's effectiveness is not biased.

answered Nov 24 '10 at 18:09

Thank you very much for your response. However, I am not sure if I was able to explain my problem. I've edited the question and added a concrete example. If it is not too much hassle, could you check and (if necessary) revise your response ? – mlwida Nov 25 '10 at 07:46
@steffen If you wish to deploy A if it is effective on a per day basis then you do aggregate to the day level. If on the other hand you do not care if it is inferior on some days then you aggregate it across days. Of course, if you aggregate it across days then as you said you will have one data point which means that you need to run the test many times across weeks/months to assess 'statistical effectiveness'. You need to decide first the answer to the qn: Under what conditions will you conclude that A is effective? and test if those conditions are met via experiments. – Nov 25 '10 at 22:26
@Srikant Thanks again for your response, especially for the question (I was too focused on statistical properties and hence lost the sight of the aim). First I was afraid that I am totally wrong, because it seems that hardly anyone is discussing this issue (at least on the web). Now I am afraid that most AB-Testers do not ask even this question. – mlwida Nov 26 '10 at 07:45

score 0 · Answer 3 · answered Feb 04 '22 at 12:00

Raw data without aggregation is the easiest way, can conveniently handle more complex situations.

with enough traffic you can make nearly anything significant within one day?

Compared with most academic studies, the sample size of ab-test in tech companies is generally large. Still, most ab-test are actually under power. Because of large coefficients of variation and very small effect sizes (Variance and significance in large-scale online services ).

more complex scene

iid assumption

Aggregation to day/hour basis then do a paired-t often implicitly violation of the iid assumption, results in seriously underestimate std error/p-value.

In a typical ab test, the experiment is randomized by the user, this means that the same user will reenter experiments on different days and be aggregated to different data points, resulting in independence between data points.

estimating heterogeneous treatment effects

One (possibly influencing) example to stimulate your thoughts: Assume that on one day during the test the "National Blue Day" is celebrated, so the color blue together with positive emotions is visible in all the media. This day group A has created a ton of conversions more than group B.

As @Andy W mentioned, raw data with regression (y ~ treat + day_of_week + trt * day_of_week) will be helpful to identify heterogeneous treatment effects.

variance reduction

Assume that the landing page belongs to a vegetable company. One day the "National Vegetable Day" is celebrated and now everyone wants to subscribe to the newsletter, no matter what the color is

Aggregation to day basis then do a paired-t doesn't make sense for variance reduction and may result in large variance and biased std error. A linear regression (y ~ treat + day_of_week) or post-stratification might be appropriate(Does it make mathematically sense to aggregate data in order to reduce variance in statistical significance tests?)

Aggregation-Level in AB-Tests

3 Answers3

more complex scene

Linked