What you are looking for is called "Determination of minimum sample size" for a particular test which is an application of statistical power analysis.
In your special cases one anaylses a rxs contigency table. However, I can only provide some details to the 2x2 case, which can be extended to the multiple test case using e.g. the so called Bonferroni correction (details below). One test which can be performed here is a so called chi2-test, e.g. Fisher's exact test.
Let's say:
- conversion_rate =$\frac{sales}{clicks}$
- $p_i$ conversion-rate of Combination i
- $n_i$ sample size for Combination i (aka the number of clicks)
Now what you want to calculate is: What is the minimum sample required sample size $n_i$ and $n_j$ so that my preferred statistical test with significance level $\alpha$ detects the difference $p_i-p_j$ with probability $1-\beta$, where ...
- $\alpha$ denotes the probability that one rejects the Null-Hypothesis although it is true (i.e. calls a difference significant which is not)
- $\beta$ denotes the probability that one does not reject the Null-Hypothesis although it is false (i.e. fails to identify a significant difference).
One formula for the case of Fisher's exact test is from Casagrande et al. which is (according to my reference):
$n_i:=n_j:=\frac{A(1+\sqrt{1+4\delta/A})^2}{4\delta^2}$
where
$A=\left(u_{1-\alpha}\sqrt{2\frac{p_i+p_j}{2}(1-\frac{p_i+p_j}{2})}-u_{\beta}\sqrt{p_i(1-p_i)+p_j(1-p_j)}\right)^2$ and $\delta=p_i-p_j$
where
$u_{\alpha}$ is the quantile of the standard normal distribution for the probability $\alpha$
As you have seen above, $n_i$ is set equivalent to $n_j$ here. But since you are performing an ABC-Test, this should not be a problem because the sample sizes for all combinations are roughly the same.
The Bonferroni correction:
Since you have three combinations, you have to perform at least 3 tests (1 against 2, 2 against 3, 1 against 3), so the $\alpha$ value you should use here is:
$\alpha_{corrected}$=your-desired-alpha/3 (same for $\beta$).
Now let's perform an exemplary calculation:
Let's say you want to detect at least a difference reflecting a 10% increase (assuming that a lesser difference, although significant, would not be of interest (e.g. because of cost effectiveness)).
So we got:
- $p_1=\frac{10}{500}$
- $p_2=\frac{10}{500}*1.1$ (which is roughly equivalent to the "true" $p_2=\frac{11}{498}$)
- $p_3=\frac{15}{503}>p_2*1.1$
- let's say $\alpha=0.05$ => $\alpha_{corrected}=0.05/3\approx 0.0167$
- let's say $\beta=0.2$ => $\beta_{corrected}=0.2/3\approx 0.0667$
Hence:
- $n_{p_1vsp_2}=136382.6$
- $n_{p_1vsp_3}=6425.552$
- $n_{p_2vsp_3}=10608.72$
You see that the main influence on the outcome is the difference between the ps, which is squared in the formula (see $\delta$ above). One can show greater differences faster (i.e. with a smaller sample). So if you want to e.g. show in an AB-Test that Combination3 is better than Combination1 (assuming that the measured conversion-rates are the actual true ones), one can do this with a sample size of 4481.86 (calculated without any $\alpha$ or $\beta$-correction). A week, if you create $\frac{4481.86*2}{7}\approx 1281$ clicks per day.
Final Note: This so called "probability of being better" is assumedly calculated using a bayesian approach (I started a discussion about that here). I would not make a decision based on that number if it is not close enough to 0 or 1 (i.e. 0.95). One can also calculate the sample size bayesian style, but I am not done with it yet (I am also struggling with interpretation of GWO - results).