1

I have been searching for information on how to calculate sample size for A/B testing. My research was fruitless. I found many online calculators, but not any blog/paper that explains how the calculation is performed or how the formula is derived.

Can anybody explain how you can calculate the sample size for an AB test given the power, significance level, effect size?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
aghd
  • 249
  • 2
  • 8
  • 1
    Welcome to CV! If you mean the equivalent of a $t$-test for comparing elements of websites, just look up one of the examples here or elsewhere on $t$-test sample size calculation. – Frans Rodenburg May 30 '19 at 08:25
  • Possible duplicate of [Is there a minimum sample size required for the t-test to be valid?](https://stats.stackexchange.com/questions/37993/is-there-a-minimum-sample-size-required-for-the-t-test-to-be-valid) – Frans Rodenburg May 30 '19 at 08:26

1 Answers1

1

There is no equation in closed form because of the computations involved. This web page has relevant theory and formulas in Sect. 2.2 beginning on p143. I will try to show one example to illustrate the computations involved.

Suppose you are doing a one-sided, pooled 2-sample t test at significance level $\alpha = 0.05.$ Your estimate of the common standard deviation is $\sigma = 4.$

Then, the crucial quantities are the size $\delta$ of the effect, the number $n$ of observations in each sample, and the power $\pi$ of the test against the difference of $\delta.$ In principle, if you specify any two of $\delta, n,$ and $\pi,$ then the third can be obtained. To begin, suppose $\delta = 5, n = 10,$ and we seek $\pi.$

The critical value $c$ of the test is determined so that $c$ cuts probability $\alpha$ from the upper tail of Student's t distribution with degrees of freedom $\nu = 2n - 2.$ That is, under $H_0$ the pooled $T$ statistic will lead to rejection if $T > c.$

In particular, for the specific values mentioned above, we can find $c = 1.734,$ using R statistical software as follows:

qt(.95, 18)
[1] 1.734064

In order to find the power $\pi,$ we need to use the non-central t distribution with noncentrality parameter $\lambda = \frac{\delta}{\sigma\sqrt{2/n}}.$ According to this noncentral t distribution, and assuming the alternative hypothesis to be true, we want the probability $P(T \ge c) = 0.851.$ (See Wikipedia for some technical details of the noncentral t distribution.)

n = 10;  df = 2*n - 2;  cv = qt(.95, df);  cv
[1] 1.734064
sg = 4;  dlt = 5;  lam = dlt/(sg*sqrt(2/n));  lam
[1] 2.795085
pwr = 1 - pt(cv, df, lam);  pwr
[1] 0.8514775

Many statistical software programs have procedures for power and sample size. The following power curve for the values we used above is from Minitab. The value computed above using R is shown as a dot on the curve. Minitab's result matches our computation.

enter image description here

If you want to specify $\delta$ and $\pi,$ then many of these programs will search for $n$ just large enough to give the requested power. The most efficient design for a two-sample test is to have the sample sizes equal, and so most programs give one value of $n$ for each sample.

If you want to do a Welch 2-sample test, then you have to specify the two standard deviations, used in a slightly revised formula for $\lambda$ (A formula on p144 of the link above shows how to handle a Welch test with $n_1 \ne n_2.$ There, $T_\nu(\cdot)$ represents the CDF of a t distribution and $T_\nu(\cdot | \lambda)$ the CDF of a noncentral t distribution.)

Power computations for two-sided tests are similar, but there are two terms to compute (one for each tail); often one of the two terms is so small it can be ignored for practical purposes.

BruceET
  • 47,896
  • 2
  • 28
  • 76