3

The question is simple. As for most of the online A/B test, we are more interested in the delta% than delta, for example:

delta = (mean of treatment - mean of control)
delta% = (mean of treatment - mean of control)/mean of control. 

Let say if we use two-mean T-test, (statistics is Delta) we found the statistics Delta is significant. Then can we say delta% is significant?

  1. If yes, why?
  2. If not, in this case, then use delta% as statistics, how do we know its distribution (let's say the sample size is big)?
yabchexu
  • 167
  • 4
  • You need to distinguish between statistical significant and practical significance. For example, if you have 1+ million samples, you could show the treated groups spends a statistical significant 2 extra seconds viewing the web page, but if the average time per page is 5 minute, is that extra time an improvement? (a practical improvement? this is business decision and not a statistics question) – Dave2e Oct 02 '20 at 13:23
  • @Dave2e Your point about the distinction between economic and substantive significance is very relevant. But it is certainly possible to cook up an example where only one of the two passes at $\alpha = 5 \%$. After all, $X-Y$ and $(X-Y)/Y$ are different random variables, with different distributions. – dimitriy Oct 03 '20 at 22:03
  • @Dave2e You are right. But you misunderstand my question. Actually, I was asking, in order to know delta% significant, could we use delta as the statistics? if not we have to use delta% as the statistics in the hypothesis test. The question is not asking if it is practically significant. – yabchexu Oct 04 '20 at 01:28
  • @DimitriyV.Masterov That what I am asking. (−)/ usually we have way to estimate the variance but it is difficult for me to determine its distribution. So usually I will use (−) as the Rv to determine if (−)/. I am not sure if this is strict. – yabchexu Oct 05 '20 at 18:42

2 Answers2

6

The absolute (delta) and relative (delta%) changes are different random variables, so you should try to calculate the ratio-based standard errors, CIs, and p-values if you care about the latter. This will not change your decision most of the time, but you will come across examples where it does matter (wider CIs, higher p-values, etc.). Ratios can be tricky like that.

Here's a toy example to make things clearer. Consider a two-sample test with a binary outcome where $N_T=N_C=1,359$. There are 163 successes in treatment, and 136 in control. The p-value on the absolute diff is 0.098, so you would reject the null that the two groups are same at $\alpha=.10$. However, the p-value on the relative difference is 0.101, so you would fail to reject. In some sense, this is an artifact of using a fixed threshold for significance and the approximation inherent in using delta method, but could lead to different decisions with the same data and decision rule but different definitions of difference.

Now on to your second question. There are many ways to calculate the variance, with varying complexity. It depends on what tools you have access to, features of your data and experiments, and your company's level of statistical sophistication.

These methods are:

  1. Delta method (either with correlated means or uncorrelated means)
  2. Fieller's method (with correlated or uncorrelated means or regression version)
  3. Regression (either transforming the outcome, or transforming the coefficients or using a GLM and then using the delta method or Fieller's method)
  4. Bootstrapping (relative difference itself or regression), permutation tests
  5. Some combinations of the above, like bootstrapped GLM regression

If you are willing to assume that the two means are uncorrelated (which usually makes sense in an A/B test), there are simple formulas linked above you can use (either delta method or based on Fieller's method). There are also canned commands/packages/online calculators.

If you are not willing to assume that, you can use regression since that returns the covariances pretty easily. Then you can use either the more complicated formulas that have the covariance term or have some stats package handle that for you. Another potential option is to log the outcome or use a GLM model to get the effects in percent.

Personally, I find some version of regression easiest, and that still works even if there is no correlation since in that case the covariance will be close to zero.

You can also bootstrap easily, ether the relative change itself or using regression coefficients. There is no formula since this is a resampling method. Make sure to set the seed so that you can replicate your work each time.

None of these approaches are exact, they are all approximations. In the toy example below, they align pretty closely.

For example, the delta method formula for the standard error of the relative change is

$$SE \left( \frac{B-A}{A} \right) \approx \sqrt{\frac{Var(B)\cdot B^2 - 2 \cdot Cov(A,B)\cdot A \cdot B + Var(A)\cdot A^2}{A^4}},$$

where $A$ is the mean in the control group and $B$ is the mean in the treatment group. Assuming uncorrelated means leads to covariance term being zero, simplifying the formula. Otherwise, regression is the easiest way to get the covariance between the two means.

Below I will compare blood pressure between men and women (analogous to a treated and control group) using Stata. I annotated the Stata code with some brief explanations.

You can find some regression-based examples of Stata and R code in Lye, J., & Hirschberg, J. (2018). Ratios of Parameters: Some Econometric Examples. Australian Economic Review, 51(4), 578–602. doi:10.1111/1467-8462.12300.

In this dataset, women have 5% lower BP relative to men, which is about 8 mmHg in absolute terms. All of the relative difference CIs are roughly in the [-8%,-2%] range:

. sysuse bplong, clear
(fictional blood-pressure data)

. keep if when=="After":when
(120 observations deleted)

. isid patient

. /* summary stats */
. table sex, c(mean bp semean bp sd bp N bp)

----------------------------------------------------------
      Sex |   mean(bp)     sem(bp)      sd(bp)       N(bp)
----------+-----------------------------------------------
     Male |   155.5167    1.967891    15.24322          60
   Female |      147.2    1.515979    11.74272          60
----------------------------------------------------------

. label list sex
sex:
           0 Male
           1 Female

. set seed 10122020

. 
. /* (A) Absolute effect */
. ttest bp, by(sex) reverse

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
  Female |      60       147.2    1.515979    11.74272    144.1665    150.2335
    Male |      60    155.5167    1.967891    15.24322    151.5789    159.4544
---------+--------------------------------------------------------------------
combined |     120    151.3583    1.294234    14.17762    148.7956     153.921
---------+--------------------------------------------------------------------
    diff |           -8.316667    2.484107               -13.23587   -3.397459
------------------------------------------------------------------------------
    diff = mean(Female) - mean(Male)                              t =  -3.3480
Ho: diff = 0                                     degrees of freedom =      118

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0005         Pr(|T| > |t|) = 0.0011          Pr(T > t) = 0.9995

. regress bp i.sex

      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(1, 118)       =     11.21
       Model |  2075.00833         1  2075.00833   Prob > F        =    0.0011
    Residual |  21844.5833       118  185.123588   R-squared       =    0.0867
-------------+----------------------------------   Adj R-squared   =    0.0790
       Total |  23919.5917       119  201.004972   Root MSE        =    13.606

------------------------------------------------------------------------------
          bp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |  -8.316667   2.484107    -3.35   0.001    -13.23587   -3.397459
       _cons |   155.5167   1.756529    88.54   0.000     152.0383    158.9951
------------------------------------------------------------------------------

. 
. /* (B) Relative Effect */
. 
. /* (-1) logged outcome t-test (works for strictly positive data and small relative differences) */
. generate ln_bp = ln(bp)

. ttest ln_bp, by(sex) reverse

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
  Female |      60    4.988759    .0100682    .0779881    4.968612    5.008905
    Male |      60    5.041963    .0127957    .0991153    5.016358    5.067567
---------+--------------------------------------------------------------------
combined |     120    5.015361    .0084655     .092735    4.998598    5.032123
---------+--------------------------------------------------------------------
    diff |            -.053204    .0162819               -.0854466   -.0209615
------------------------------------------------------------------------------
    diff = mean(Female) - mean(Male)                              t =  -3.2677
Ho: diff = 0                                     degrees of freedom =      118

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0007         Pr(|T| > |t|) = 0.0014          Pr(T > t) = 0.9993

. 
. /* (0) bootstrap means */
. capture program drop mybs

. program define mybs, rclass
  1.         quietly summarize bp if sex=="Female":sex
  2.         scalar female_avg_bp = r(mean) 
  3.         quietly summarize bp if sex=="Male":sex
  4.         scalar male_avg_bp = r(mean) 
  5.         return scalar ratio = (female_avg_bp - male_avg_bp)/male_avg_bp
  6. end

. 
. bootstrap ratio = r(ratio), reps(500) nodots nowarn: mybs

Bootstrap results                               Number of obs     =        120
                                                Replications      =        500

      command:  mybs
        ratio:  r(ratio)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ratio |  -.0534777   .0153194    -3.49   0.000    -.0835031   -.0234522
------------------------------------------------------------------------------

. 
. 
. /* (1b) delta method using regression and ratio of predictions by hand */
. regress bp i.sex

      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(1, 118)       =     11.21
       Model |  2075.00833         1  2075.00833   Prob > F        =    0.0011
    Residual |  21844.5833       118  185.123588   R-squared       =    0.0867
-------------+----------------------------------   Adj R-squared   =    0.0790
       Total |  23919.5917       119  201.004972   Root MSE        =    13.606

------------------------------------------------------------------------------
          bp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |  -8.316667   2.484107    -3.35   0.001    -13.23587   -3.397459
       _cons |   155.5167   1.756529    88.54   0.000     152.0383    158.9951
------------------------------------------------------------------------------

. nlcom ratio:(_b[1.sex])/_b[_cons]

       ratio:  (_b[1.sex])/_b[_cons]

------------------------------------------------------------------------------
          bp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ratio |  -.0534777    .015552    -3.44   0.001     -.083959   -.0229963
------------------------------------------------------------------------------

. margins, eydx(sex) // another way: calculate the elasticity

Conditional marginal effects                    Number of obs     =        120
Model VCE    : OLS

Expression   : Linear prediction, predict()
ey/dx w.r.t. : 1.sex

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |  -.0549607   .0164307    -3.35   0.001    -.0874979   -.0224235
------------------------------------------------------------------------------
Note: ey/dx for factor levels is the discrete change from the base level.

. 
. /* (2a) logged outcome regression */
. /* works for strictly positive data and small relative differences */
. regress ln_bp i.sex

      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(1, 118)       =     10.68
       Model |  .084920032         1  .084920032   Prob > F        =    0.0014
    Residual |  .938452791       118   .00795299   R-squared       =    0.0830
-------------+----------------------------------   Adj R-squared   =    0.0752
       Total |  1.02337282       119  .008599772   Root MSE        =    .08918

------------------------------------------------------------------------------
       ln_bp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |   -.053204   .0162819    -3.27   0.001    -.0854466   -.0209615
       _cons |   5.041963    .011513   437.94   0.000     5.019164    5.064762
------------------------------------------------------------------------------

. 
. /* (2b) GLM with exponentiated coefficients */
. glm bp i.sex, family(gaussian) link(log) nolog

Generalized linear models                         Number of obs   =        120
Optimization     : ML                             Residual df     =        118
                                                  Scale parameter =   185.1236
Deviance         =  21844.58333                   (1/df) Deviance =   185.1236
Pearson          =  21844.58333                   (1/df) Pearson  =   185.1236

Variance function: V(u) = 1                       [Gaussian]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   8.075427
Log likelihood   = -482.5256155                   BIC             =   21279.66

------------------------------------------------------------------------------
             |                 OIM
          bp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |  -.0549607   .0164307    -3.35   0.001    -.0871643   -.0227571
       _cons |   5.046753   .0112948   446.82   0.000     5.024616     5.06889
------------------------------------------------------------------------------

. 
. /* (3) bootstrap ratio of predictions from regression by hand */
. bootstrap ratio = (_b[1.sex]/_b[_cons]), reps(500) nodots: regress bp i.sex

Linear regression                               Number of obs     =        120
                                                Replications      =        500

      command:  regress bp i.sex
        ratio:  _b[1.sex]/_b[_cons]

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ratio |  -.0534777   .0147228    -3.63   0.000    -.0823338   -.0246215
------------------------------------------------------------------------------

. 
. /* (4) Fieller's method (uncorrelated means) */
. /* there is also a correlated means version */
. fieller bp, by(sex) reverse
Confidence Interval for a Quotient by Fieller's Method (Unpaired Data)

Numerator Mean:   147.2
Denominator Mean: 155.51667
Quotient:         .94652234
95% CI:      .91652092�.97771318

. 
. /* (5) delta method by hand (uncorrelated means) */
. /* there is also a correlated means version */
. table sex, c(mean bp sd bp N bp)

----------------------------------------------
      Sex |   mean(bp)      sd(bp)       N(bp)
----------+-----------------------------------
     Male |   155.5167    15.24322          60
   Female |      147.2    11.74272          60
----------------------------------------------

. display "SE(ratio) = " sqrt(((15.24322^2/60)*(155.5167)^2+(11.74272^2/60)*(147.2)^2)/(155.5167^4))
SE(ratio) = .01566056

The Filler method above calculates $\frac{\bar Y_{female}}{\bar Y_{male}}$ rather than the relative change, but they are equivalent. The paper linked above has R and Stata code to calculate the relative change with regression.


Here is some code showing that the p-values can also differ depending on whether absolute or relative change is used with Wald and Wald-type tests:

. sysuse bplong, clear
(fictional blood-pressure data)

. keep if when=="After":when
(120 observations deleted)

. estimates clear

. qui regress bp i.sex

. /* Absolute effect Wald-type test */
. testnl _b[1.sex] = 0

  (1)  _b[1.sex] = 0

               chi2(1) =       11.21
           Prob > chi2 =        0.0008

. display r(p)
.00081412

. /* Relative effect Wald-type test */
. testnl _b[1.sex]/_b[_cons] = 0

  (1)  _b[1.sex]/_b[_cons] = 0

               chi2(1) =       11.82
           Prob > chi2 =        0.0006

. di r(p)
.00058466

. /* Absolute effect Wald test */
. test _b[1.sex] = 0

 ( 1)  1.sex = 0

       F(  1,   118) =   11.21
            Prob > F =    0.0011

. display r(p)
.00109302

. /* Relative effect Wald test */
. margins, eydx(sex) post

Conditional marginal effects                    Number of obs     =        120
Model VCE    : OLS

Expression   : Linear prediction, predict()
ey/dx w.r.t. : 1.sex

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |  -.0549607   .0164307    -3.35   0.001    -.0874979   -.0224235
------------------------------------------------------------------------------
Note: ey/dx for factor levels is the discrete change from the base level.

. test _b[1.sex] = 0

 ( 1)  1.sex = 0

       F(  1,   118) =   11.19
            Prob > F =    0.0011

. di r(p)
.00110368
dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • Hi Dimitriy, Thanks for this comprehensive explanation. It helps me a lot. I think to estimate the variance of ratio, I know how to do also you list several methods above. but have the estimation of the variance of the ratio is not enough, we need to know the distribution of stats in order to calculate the p-value if using a parametric test. I am not familiar with the wald-test above. Just take the example you gave " a two-sample test with a binary outcome" which test you are using and how did you calculate the p-value? – yabchexu Oct 17 '20 at 06:33
  • I used regression to do that part, with the delta method for the relative difference variance. You can also bootstrap to get the distribution of the statistic. Fieller’s method could be used to iteratively to get a p-value. I suppose permutation tests are another option to get the distribution under the null. – dimitriy Oct 17 '20 at 06:40
  • Thanks. For the nonparametric method, it is easy as we don't assume any distribution here. for the parametric method, I was confused about how to do it. Thanks a lot. – yabchexu Oct 17 '20 at 08:30
3

Equivalent null hypotheses

Yes

When you are asking the question about significance then you relate to hypothesis testing.

For your situation, the hypotheses $H_0: \Delta = 0$ and $H_0: \Delta\% = 0$ are equivalent if you assume that 'mean of control' is non-zero (and if you do not assume that, then the $\Delta\%$ becomes a problematic definition due to the potential division by zero).

So if $\Delta$ is significantly different from $0$ then you can also claim that $\Delta\%$ is significantly different from $0$.

Distribution of the data, not the parameter estimate

Also, note that the testing is normally not performed by observing only $\Delta$ or only $\Delta\%$. In that case, you would have a problematic situation with unknown nuisance parameters.

Instead, you use some test statistic which is based on the observations of treatment and observations of control, e.g their means $\mu_{treatment}$ and $\mu_{control}$. The test procedure would be the same for hypotheses $H_0: \Delta = 0$ and $H_0: \Delta\% = 0$ because you do not base yourself on the sample distributions of $\Delta$ or $\Delta\%$, but instead on the joint sample distribution of $\mu_{treatment}$ and $\mu_{control}$. You base a significance test on the data.

You might think that the significance test is different because estimates of $\Delta$ and $\Delta\%$ have different sample distributions. However, the parameter estimate and it's sample distribution is not necessarily the statistic that is used as a significance test.

For instance, when we perform linear regression, then we might estimate a parameter and perform a t-test which relates to the sample distribution of the parameter. But, we could also perform an analysis of variance and perform a F-test, which doesn't care how you express the parameter.

Significance testing is not primarily about the distribution of the estimate of some statistic (it can be used in testing, but it is derivative and not a first principle). Instead, it is in the first place about the distribution of the data conditional on the null hypothesis being true. The sample distribution of the data is the same for both null hypotheses. Therefore if an observation is significant for the one hypothesis, then it is also significant for the other.

Significance means that you made an extreme observation given the null hypothesis.


Confidence intervals

Where the use of $\Delta$ and $\Delta\%$ may differ is in the expression of confidence intervals. In this case, we are not talking anymore about the null hypothesis $H_0: \Delta = 0$ that assumes the parameter is equal to zero. But instead we consider the range of parameters $\theta$ for which the hypothesis $H_0: \Delta = \theta$ will pass a significance test (passing means no significance). The hypothesis $H_0: \Delta = \theta$ and $H_0: \Delta \% = \theta$, with $\theta \neq 0$ are not equivalent.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • Does the p-value invariance hold in a regression setting? I added an example above where the p-values do seem to vary slightly depending on whether relative or absolute change is used. Or am I missing something? – dimitriy Oct 15 '20 at 20:29
  • 1
    The p-values could definitely differ between the two because as you said, they are different random variables with different sampling distributions under $H_0$. The point is that any test that rejects one $H_0$ rejects the other. One test may be more powerful, though. It's not that uncommon for multiple tests to test the same null hypothesis in different ways with different test statistics and p-values, like Bartlett's test and Levene's test for equality of variances. – Noah Oct 15 '20 at 20:40
  • @DimitriyV.Masterov I agree with Noah, the p-values *can* differ for different tests. But when we consider $H_0:\Delta$ or $H_0:\Delta\%$ we are not necessarily considering different tests. The parameter estimates may be different statistics with different sample distributions, but they are not typically used to do the inference (in the same way in linear regression: you could consider the sample distribution of a coefficient, and perform a t-test, but for a more optimal test you should consider a F-test, which considers the residuals/data, and doesn't care how you express the coefficient). – Sextus Empiricus Oct 15 '20 at 21:03
  • @DimitriyV.Masterov I am not sure why your examples relate to different null hypotheses. You consider different variables, like the logarithm, but why does the one relate more or less than the other to a hypothesis like $H_0:\Delta=0$ or $H_0:\Delta\%=0$? – Sextus Empiricus Oct 15 '20 at 21:11
  • Significance testing is not about the distribution of the estimate of some statistic (it can be used in testing, but it is derivative), but it is in the first place about the distribution of the data conditional on the null hypothesis being true. **The sample distribution is the same for both null hypotheses.** Therefore if an observation is significant for the one hypothesis, then it is also significant for the other. – Sextus Empiricus Oct 15 '20 at 21:16
  • The example at the bottom of my post does not involve logarithms. I have two types of tests for the absolute difference and the relative difference, with different p-values. – dimitriy Oct 15 '20 at 21:26
  • @DimitriyV.Masterov with those two Wald tests you are not testing different hypothesis, you are just using different statistics (the estimates of the coefficients $\hat\Delta$ vs $\hat\Delta\%$ and estimates of their sample distributions) to perform the significance test. The tests are indeed different, but it does not mean that the related hypotheses are different. – Sextus Empiricus Oct 15 '20 at 21:44