Hello fellow number crunchers
I hope this a valid question for this forum. I am a lonesome quarter-statistician and have trouble finding someone to ask.
Introduction:
The AB-Test has become really popular since it is so easy to implement and execute. Additionally the web is floated with blogs explaining how to determine the significance of the results. All in all, it seems that there is less discussion about the control or exclusion of possible "influential" variables (on the other hand, controlling such variables is quite hard on the web).
Most AB-Tests comparing the outcomes of both groups by simple counting how many clicks or conversions every group has generated. Than a binomial distribution is assumed for each group and hence statistical tests are performed to see which group got the greater p.
So the question is:
Is it "better" to compare the outcome of both groups without any aggregation or it is "better" to aggregate e.g. on daily basis ?
Example: The AB-Test is to check whether a landing page creates more newsletter subscriptions (<- conversions in this case). The AB-Test is deployed/online the whole test-time. In group A the landing page's main color is blue, in group B the main color is red. Assume 2000 visitors per day, i.e. each group gets roughly 1000 visitors. In this case "without aggregation" means, that I get 1000 datapoints per group per day meanwhile "aggregation on daily basis" means, that I get one (!) datapoint per group per day.
Discussion
The latter (aggregation on daily baiss) would allow the pairing of values, which in turn can capture daily effects like peaks in user behavior and preferences, but it extends the duration of AB-Tests, because it takes longer to collect enough datapoints. On the other hand, the former (no aggregation) seems to the strategy of the majority, because ... I dont know, maybe because a) with enough traffic you can make nearly anything significant within one day b) it is the easiest thing to do.
One (possibly influencing) example to stimulate your thoughts: Assume that on one day during the test the "National Blue Day" is celebrated, so the color blue together with positive emotions is visible in all the media. This day group A has created a ton of conversions more than group B.
This difference clearly affects the test if aggregated on daily basis (increase of variance) or (if not aggregated at all) it either vanishes in the sea of data (in the case that multiple days are collected without aggregation) or it leads to the wrong results (if the day is the first and only day of the test).
Another example: Assume that the landing page belongs to a vegetable company. One day the "National Vegetable Day" is celebrated and now everyone wants to subscribe to the newsletter, no matter what the color is. This short-time effect is captured by aggregation on daily basis and a paired test, but it increases the variance in the case of no aggregation (because no paired test can performed here (is this even correct ?))
All in all: Am I on the right track or do I miss something completely ?