3

I work quite often with Google's Website Optimizer, essentially it allows you to make small changes to a site to determine if they have an effect on the ratio of clicks to a web page to sales. The stats look like this:

Combination 1: 500 clicks  10 sales
Combination 2: 498 clicks  11 sales
Combination 3: 503 clicks  15 sales

Now obviously that part makes sense to me, but then it gives a value called "Probability of being better" which is based (among other factors) on the sample size (see also this question). So if a particular experiment has less data it would rate it as having a 13% chance at being better where as at a larger dataset it would have a 90% chance at being correct.

Obviously a smaller sample size would be more prone to swings in variance but I'm curious to know if there's a formula to determine when enough data would have been collected so that a high probability of being better can be calculated (which seems to be equivalent to a small overlapping of the corresponding confidence intervals and hence a high probability of rejecting the Null-Hypothesis).

brett
  • 31
  • 1
  • 1
    See if you can get a little more precise: "when enough data would be collected" for what? For x% confidence, perhaps? You're asking about the link between sample size and precision of estimates, which is a matter of statistical power. Confidence intervals (CI) for the number of sales (or for the % of clicks that results in sales) will, as you indicate, depend in large part on sample size. The basic formula for a CI for a proportion is given at http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval – rolando2 Apr 30 '11 at 17:02
  • I edited the question (we'll see if it pass peer review ;)). Please check that the edits do not contradict your intention. You misleadingly called the "probability of being better" of GWO "confidence", which confused some statisticians here. Please correct me if you referred to another thing with the term "confidence of probabilities". – mlwida May 03 '11 at 09:10

1 Answers1

5

What you are looking for is called "Determination of minimum sample size" for a particular test which is an application of statistical power analysis.

In your special cases one anaylses a rxs contigency table. However, I can only provide some details to the 2x2 case, which can be extended to the multiple test case using e.g. the so called Bonferroni correction (details below). One test which can be performed here is a so called chi2-test, e.g. Fisher's exact test.

Let's say:

  • conversion_rate =$\frac{sales}{clicks}$
  • $p_i$ conversion-rate of Combination i
  • $n_i$ sample size for Combination i (aka the number of clicks)

Now what you want to calculate is: What is the minimum sample required sample size $n_i$ and $n_j$ so that my preferred statistical test with significance level $\alpha$ detects the difference $p_i-p_j$ with probability $1-\beta$, where ...

  • $\alpha$ denotes the probability that one rejects the Null-Hypothesis although it is true (i.e. calls a difference significant which is not)
  • $\beta$ denotes the probability that one does not reject the Null-Hypothesis although it is false (i.e. fails to identify a significant difference).

One formula for the case of Fisher's exact test is from Casagrande et al. which is (according to my reference):

$n_i:=n_j:=\frac{A(1+\sqrt{1+4\delta/A})^2}{4\delta^2}$

where

$A=\left(u_{1-\alpha}\sqrt{2\frac{p_i+p_j}{2}(1-\frac{p_i+p_j}{2})}-u_{\beta}\sqrt{p_i(1-p_i)+p_j(1-p_j)}\right)^2$ and $\delta=p_i-p_j$

where

$u_{\alpha}$ is the quantile of the standard normal distribution for the probability $\alpha$

As you have seen above, $n_i$ is set equivalent to $n_j$ here. But since you are performing an ABC-Test, this should not be a problem because the sample sizes for all combinations are roughly the same.

The Bonferroni correction: Since you have three combinations, you have to perform at least 3 tests (1 against 2, 2 against 3, 1 against 3), so the $\alpha$ value you should use here is: $\alpha_{corrected}$=your-desired-alpha/3 (same for $\beta$).

Now let's perform an exemplary calculation: Let's say you want to detect at least a difference reflecting a 10% increase (assuming that a lesser difference, although significant, would not be of interest (e.g. because of cost effectiveness)).

So we got:

  • $p_1=\frac{10}{500}$
  • $p_2=\frac{10}{500}*1.1$ (which is roughly equivalent to the "true" $p_2=\frac{11}{498}$)
  • $p_3=\frac{15}{503}>p_2*1.1$
  • let's say $\alpha=0.05$ => $\alpha_{corrected}=0.05/3\approx 0.0167$
  • let's say $\beta=0.2$ => $\beta_{corrected}=0.2/3\approx 0.0667$

Hence:

  • $n_{p_1vsp_2}=136382.6$
  • $n_{p_1vsp_3}=6425.552$
  • $n_{p_2vsp_3}=10608.72$

You see that the main influence on the outcome is the difference between the ps, which is squared in the formula (see $\delta$ above). One can show greater differences faster (i.e. with a smaller sample). So if you want to e.g. show in an AB-Test that Combination3 is better than Combination1 (assuming that the measured conversion-rates are the actual true ones), one can do this with a sample size of 4481.86 (calculated without any $\alpha$ or $\beta$-correction). A week, if you create $\frac{4481.86*2}{7}\approx 1281$ clicks per day.

Final Note: This so called "probability of being better" is assumedly calculated using a bayesian approach (I started a discussion about that here). I would not make a decision based on that number if it is not close enough to 0 or 1 (i.e. 0.95). One can also calculate the sample size bayesian style, but I am not done with it yet (I am also struggling with interpretation of GWO - results).

mlwida
  • 9,922
  • 2
  • 45
  • 74