Does it make mathematically sense to aggregate data in order to reduce variance in statistical significance tests?

Question

Let's consider an e-commerce site. We have an AB test for which we want to measure if the average revenue from treatment A is statistically significant different from B. i.e. the main goal is to determine if A is statistically significant better than B (or vice versa).

Both A, B variants generate a continuous stream of revenues for each purchased item. For most of the items the revenue is zero. Our measure is the average revenue per displayed item. i.e. The raw data is composed of many items with their associated revenue if purchased or zero otherwise.

The generated revenue has some patterns. For example, the sport category generates much more revenue than other categories. Another example is the day of week pattern in which we have much less revenue on weekends.

In order to reduce the variance I would like to account for the revenue patterns. One option is to aggregate the data. For example, calculate the avg revenue per category, calculate the difference in revenue between A & B for each category, and then do a paired t-test on all the revenue differences. Or aggregate the results per day and then send to t-test the differences in revenue for each day.

The aggregation is done only to decrease variance, the research question is still, which treatment will generate statistically significant more revenue globally (i.e. considering all days and all categories)

My questions:

Suppose that I do aggregation per day

Does it make sense to aggregate and do paired t-test ? One of the downsides of this approach is that we lose the information about the standard deviation of each day
If the different aggregations have different sizes (e.g. each day has different revenue) should i weigh it somehow (i.e. give higher revenue days more weight) and if so how can i weigh in paired t-test?
Any other ideas of how to do it?

This question suggest that pairing is a good idea, but there are some crucial differences. for example, the data in this problem is already aggregated, and there is no solution for the different sizes of the aggregation.

1) What do you want to test? Better yet, what do you want to learn from your data? // 2) How would it help you to reduce the sample size? — Dave, Oct 04 '21 at 14:17
Why do you want to reduce the variance? If you look at the aggregates, the groups would need to have exactly same size, otherwise taking each aggregate as a single sample would lead to invalid results. — Tim, Oct 04 '21 at 14:27
@Dave 1) i want to conclude which treatment generate more revenue in total. 2) There is high variance between the different groups for example when i compare Tuesday to Friday there is huge difference but if i compare A on Tuesday to B on Tuesday they will be similar. — ofer-a, Oct 04 '21 at 14:33
@Tim The aggregations will be exactly the same size as A & B will get the same amount of items each day and for each category. The reason that i want to reduce noise is that the natural variance between the days is huge. Quoting from the other post: A underlying cyclic pattern on the metric violates the normal assumption and results in high SD when the samples are assumed to be i.i.d. This in turn leads to extremely large sample size for measuring small lifts. — ofer-a, Oct 04 '21 at 14:38
It sounds like you want to control for other variables, more-or-less what ANCOVA does (which considers variability from other sources, like the day of the week). That said, I am not so sure that you will be happy with your results unless you consider the time series nature of your data. — Dave, Oct 04 '21 at 14:38
rather than doing a t-test, a linear regression might be appropriate where you *control* for the other variables eg day of week, item category and then do a test on the treatment coefficient. $revenue\sim treatment + category + \text{day_of_week}$. — seanv507, Oct 04 '21 at 15:14
@seanv507 just to make sure i understand. My data is per item (millions of of items for A & B) You are suggesting that i first aggregate the data e.g by day_of_week, and then run linear regression to predict the revenue for each day and see the p-value of each variable? But if i run the experiment for 14 days i would have only 14 aggregated rows for the linear regression one per day, would that be enough? — ofer-a, Oct 04 '21 at 19:21
I am saying that conceptually you don't have to aggregate at all and just feed into a linear regression model. computationally if your input data is discrete, you could group by day/category/treatment and pass in mean revenue and variance of each group into weighted linear regression https://en.wikipedia.org/wiki/Weighted_least_squares and see https://lindeloev.github.io/tests-as-linear/ — seanv507, Oct 04 '21 at 20:23

score 0 · Answer 1 · answered Feb 01 '22 at 08:46

Aggregation then do a paired-t doesn't make sense for variance reduction and may result in large variance and biased std error. A linear regression or post-stratification might be appropriate.

Let's make some fake data.

library(tidyverse)
library(furrr)
options(future.globals.maxSize = 1073741824)

get_estimates <- function(strat_cnt = 1000) {
    
  df <- tibble(
    strat = c('s1', 's2'),  
    strat_y = c(0.05, 0.5), # two strat conversion rates are very different.
    strat_cnt = rbinom(2, 1000, c(0.7, 0.3)) # strat sample proportion rate
  ) %>% 
    rowwise() %>% 
    mutate(y = list(rbinom(strat_cnt, 1, strat_y))) %>% 
    unnest(y) %>% 
    dplyr::select(strat, y) %>% 
    mutate(trt = if_else(runif(nrow(.)) > 0.5, 1, 0)) # random assign to A/B
    
  df_agg <- df %>% 
    group_by(strat) %>% 
    summarise(
      n = n(),
      n0 = sum(trt),
      n1 = sum(1 - trt),
      s = n0 * n1 / n, # strat variance
      y0 = sum(y * trt),
      y1 = sum(y * (1 - trt)),
      strat_diff = y1 / n1 - y0 / n0,
      .groups = 'drop'
    )
  
  df_agg %>% 
    summarise(
      
      # non_aggregate
      non_aggregate = sum(y1) / sum(n1) - sum(y0) / sum(n0),
      
      # agg_avg: aggregate by strat then average without weight
      agg_avg = mean(strat_diff),
      
      # post_strat: aggregate by strat then weight by strat proportion
      weight_diff = weighted.mean(strat_diff, n),
      
      # ols regression/ancova: if strat is categorical, then ols estimate equal to weight by strat variance
      # lm(y ~ trt + strat, df)$coefficients['trt']
      ols_diff = weighted.mean(strat_diff, s)
    )  
}

plan(multisession, workers = 4)
SIM_CNT <- 100000
sim_estimate <- furrr::future_imap_dfr(
  1:SIM_CNT,
  ~get_estimates(),
  .options = furrr_options(seed = TRUE),
  .progress = TRUE
)

sim_estimate %>% 
  pivot_longer(
    cols = c('non_aggregate', 'agg_avg', 'weight_diff', 'ols_diff'), 
    names_to = 'key', 
    values_to = 'value'
  ) %>% 
  # filter(key == 'avg_day_diff') %>%
  ggplot() + 
    geom_freqpoly(aes(value, color = key), bins = 100) + 
    labs(
      x = 'diff', y = '', color = '',
      title = 'aggregation then do a paired-t result in large variance')

Does it make mathematically sense to aggregate data in order to reduce variance in statistical significance tests?

1 Answers1

Linked