Which statistical test to use for A/B/C test with proportions

Question

I face the following problem: For my startup I came up with 3 different marketing messages. I will display each of the messages to 1000 different people, so 3000 people in total. I will then measure click-through rate.

Now my question is: Which test should I use to test for statistical significance? I want to know which message performs best (if any).

score 1 · Answer 1 · answered Oct 24 '21 at 00:42

Given that your objective is to determine which, if any, of the three messages is best, I'd skip the $\chi^2$, which merely tests for differences. If two messages are equally good and substantially better than the third, the $\chi^2$ will (hopefully) return a significant result, but you won't have learned all that you want to.

An alternative is the parametric bootstrap. Using the example in the comments above, we have three samples of size $1000$ which we can model as drawn from Binomial distributions with $n=1000$ and unknown probabilities. We estimate the probabilities by the three observed frequencies and generate a large number ($B = 10,000$, for example) of draws from each of the three Binomial distributions. We then compare the three draws for $b=1, 2, \dots, B$, identifying which is the largest, and report the resulting frequencies:

# Observed data
n <- 1000
observed_counts <- c(75, 50, 50)

# Estimate probabilities
p <- observed_counts / n

# Generate 10,000 samples for each message
x1 <- rbinom(10000, n, p[1])
x2 <- rbinom(10000, n, p[2])
x3 <- rbinom(10000, n, p[3])

# Count the frequency with which each is best
best <- ifelse(x1 > x2, ifelse(x1 > x3, 1, 3),
                        ifelse(x2 > x3, 2, 3))

with the result:

> table(best)
best
   1    2    3 
9778  106  116

Message 1 was best 97.78% of the time, corresponding to a p-value of 0.0222, roughly the same as the p-value of the $\chi^2$ test given in comments above (0.028).

However, consider a situation with observed frequencies of 7.5%, 7.5%, and 5%. The bootstrap returns:

    # Observed data
n <- 1000
observed_counts <- c(75, 75, 50)

...

> table(best)

best
   1    2    3 
4823 5157   20

which makes it quite clear that, although message 3 is worse, messages 1 and 2 are not significantly different. The $\chi^2$ test, on the other hand, returns a p-value of 0.0439, not as helpful a result!

You can draw p from Beta distribution instead of using a fixed point estimate. — Pik-Mai Hui, Dec 27 '21 at 16:12

score 0 · Answer 2 · answered Mar 13 '17 at 20:00

0

When testing proportions between multiple groups/bins of data, I usually go with the $\chi^2$ Goodness-of-Fit Test. The NIST Engineering Statistics Handbook and Minitab Online help have great information on this topic.

The simple version is that you will have your observed counts in each category ($O_{A},O_{B},O_{C}$). The expected value ($E$) in this situation is the sum of all the people talked to divided by the product of the click-through rate for the group and the total click through rate. This gives you ($E_{A},E_{B},E_{C}$).

The $\chi^2$ for each category is calculated as: $$\chi^2=\frac{\left(O-E\right)^2}{E}$$ and results in ($\chi^2_A,\chi^2_B,\chi^2_C$).

The analysis then takes the sum of the chi-squared values as the hypothesis. The critical threshold of this value is defined by $\chi^2_{\alpha,df}$ and a p-value can be calculated with a chi-test comparing the observed and expected values.

answered Mar 13 '17 at 20:00

Tavrock

1,552
8
27

Thank you for the quick reply. As far as I remember, the chi-square goodness-of-fit test only shows IF one of the messages converts different from the others, correct? If I want to do pairwise comparisons for the individual messages, I should then add the Marascuilo procedure, correct? Sounds reasonable to me. However, if I played around with some fake numbers and some more versions, I get an unituitive result: Let's say: V1: 75/1000 (7,5%) V2: 50/1000 V3: 50/1000 V4: 50/1000 v5: 50/1000 The p-value is then 0,0473 if I add v6: 50/1000 the p-value becomes 0,0707 – Jan Kro Mar 13 '17 at 20:19
Conducting a pairwise test is certainly the "correct" way to do it, but you can also look at the contribution to $\chi^2$ as a good indication of if you only have one or two factors that are off from the others. – Tavrock Mar 14 '17 at 10:12
The example you provide actually makes a lot of sense. With V1: 75/1000 (7,5%) V2: 50/1000 V3: 50/1000, the $p$ value is 0,028. Your expected value is 58.333 and your critical $\chi^2$ is essentially 6. With six rows, your expected value drops to 54.167 and your critical $\chi^2$ is essentially 11. The increased critical value and the decrease between expected and observed for the majority of the data make the one moderate spike less significant. Just in case that makes you feel like the test is not sensitive enough, changing V1 to 77/1000 is enough to bring your $p$ value down to 0,049. – Tavrock Mar 14 '17 at 10:12
Thank you for the insights @Tavrock. I agree that it makes sense from the calculation of the test statistic. However, I still fail to understand it logically (or I rather wonder if the chi-square is the correct test here). Why does 7.5% vs. 5% become less significant as a result if I test more messages? The only explanation I have is that the statement "Option A is better than Option B AND Option C", should be harder to prove than "Option A is better than Option B". – Jan Kro Mar 14 '17 at 15:46
What about the following approach: Take the two messages which convert best and just use a chi-square test on these two? – Jan Kro Mar 14 '17 at 15:47
While it is testing "is option A different from B, C, D, E, *and* F" it is *also* testing "is option B different from A, C, D, E, *and* F", and so on for each of the options. The more options that you have that are the *same*, the less *difference* it will find among the majority of the options. – Tavrock Mar 14 '17 at 16:18
Another way to think of it is the "find the odd letter" tests. Finding the `C` in `CO` or `OC` is easier than `OOCOOO` or `OOOOOCO`. – Tavrock Mar 14 '17 at 16:23

score 0 · Answer 3 · answered Oct 24 '21 at 00:35

Tavrock mentions the Chi-squared test as a solution. I disagree that this would be an appropriate test for a few reasons. First of all, I imagine you are not interested in knowing if differences exist between messages, you're more interested in knowing which message leads to largest click through. This is an altogether different hypothesis than the one the chi-squared is designed to test. Additionally, even if the chi-squared was an appropriate test, the test does not easily offer estimates of click through rate along with their uncertainty.

A better (and not much more complex approach) is to use a Bayesian decision making framework, as I do here. This is approach is easily extended to $n$ marketing messages.

Which statistical test to use for A/B/C test with proportions

3 Answers3