What could be a statistical test for comparing funnel data before and after important changes?

Question

I have some funnel data from a website I manage. After some time I did some improvement on a multi-step purchase process. I wonder if it's possible to formulate a test that with statistical significance can assess if the improvements in the process have actually been effective.

In other words I would like to test if the following two columns represent significantly different behaviours or not:

\begin{array} {|r|r|r|} \hline \text{step} & \text{before changes} & \text{after changes} \\ \hline \text{product page}&121 &1349 \\ \text{clicked on purchase}&32 &412 \\ \text{step 3}&29 &391 \\ \text{step 4}&28 &386 \\ \text{step 5}&26 &371 \\ \text{transaction completed}&19 & 345 \\ \hline \end{array}

In that case you can see that the number of initial users of the first column is very different from the second column. But this is not what I would like to test: I would like to "demonstrate" that the subsequent behaviour contains relevant differences.

I would also repeat the test with data that might come from different segments of the users.

Any ideas?

I may have misunderstood your data in my response. I now think I was led astray by the word "initial". Do these numbers represent final outcomes? That is, is 121 the number of people who merely browsed the product page and then dropped off, or does it also include those who clicked, stepped, and completed? — dimitriy, Apr 08 '14 at 22:54
@DimitriyV.Masterov : first of all, thank you so much for your extremely detailed answer :) It contains so much information that I need to dedicate some time into studying it and seeing how to apply the techniques you are mentioning in our system. The numbers represent just the number of people that have been recorded being in that step. The 32 people that clicked on purchase, for example, are also included in the 121 that clicked on product page. Thank you very much for the effort, I will surely let you know once I make an implementation of your ideas, that will be in the next days :) — fstab, Apr 10 '14 at 07:56
That means the answer makes sense as is. Feel free to follow up with questions once you digest. — dimitriy, Apr 10 '14 at 14:21

dimitriy · Accepted Answer · 2014-04-15T17:41:28.690

Caveats

Most importantly, a lot of things could have changed since you made the upgrade. For instance, if a major competitor went out of business and Christmas rolled around in time for the change, your test will be conflating the positive effect of those factors with the site improvements. They could even mask that the new version is bad. It would have been a much better experimental design to split some fraction of your site traffic into the new version at random, and then compare the outcomes for those who saw the new and the old site during the same period. This holds seasonality and competition constant, so you're not comparing apples to orangutans. Not a whole lot can be done to address this (at least with the data you provided).

We are also going to assume that customers do not interact with each other (stockouts aren't possible, for instance) and that there are no repeat visits. We also assume that people can't skip steps in the funnel and that they can't enter the funnel anywhere but the beginning. These might be unrealistic assumptions for some web sites, but relaxing them complicates things.

Model and Data Transformations

If you are willing to ignore these concerns, you might try using a sequential logit model to analyze the flow through the funnel. This approach is sometimes called the sequential response model, continuation ratio logit, model for nested dichotomies, or the Mare model (after the UCLA sociologist). It has more names than an identity fraudster!

This approach requires data at individual customer level, which I can back out from the table you provided by differencing. This will handle the fact that traffic is higher on the new site. I will be using Stata and Maarten Buis' seqlogit command. Some part of this can be handled with a series of logit commands* (which any advanced statistical software should be able to handle), but seqlogit allows you to do a very cool "decomposition" of the total effect of the version on the final outcome (like purchase probability or even a dollar amount).

I've already changed your data from a Napoleon's Campaign format to final outcome by differencing adjacent cells along each column:

. /* Set Up The Data */
. set more off

. clear

. input str10 finstat byte version int n

        finstat   version         n
  1. "Browsed"       1 89
  2. "Clicked"       1 3
  3. "Step3" 1       1
  4. "Step4" 1       2
  5. "Step5" 1       7
  6. "Purchased" 1   19
  7. "Browsed" 2 937
  8. "Clicked"       2 21
  9. "Step3" 2 5
 10. "Step4" 2 15
 11. "Step5" 2 26
 12. "Purchased" 2 345
 13. end

. sencode finstat, replace

. lab define version 1 "Old" 2 "New"

. lab val version version

. expand n
(1458 observations created)

. sort version finstat

. drop n

This gives us 1470 observations/rows (I could have also used frequency weights instead of expanding the data). For instance, for the Old version, we have 89 people who only browsed and 19 who completed a purchase and 13 who dropped out somewhere in between.

. tab finstat version 

           |        version
   finstat |       Old        New |     Total
-----------+----------------------+----------
   Browsed |        89        937 |     1,026 
   Clicked |         3         21 |        24 
     Step3 |         1          5 |         6 
     Step4 |         2         15 |        17 
     Step5 |         7         26 |        33 
 Purchased |        19        345 |       364 
-----------+----------------------+----------
     Total |       121      1,349 |     1,470

Sequential Logit For Each Step In The Order Funnel

Now we fit the sequential logit of final outcome on binary version. The tree() option specifies the structure of the decision tree (the choices available at the $6$ junctures), and the levels() option specifies the revenue effects of stopping at each stage. Here I assume you only care about the completed purchase. This assumption does not alter the estimation, but will be used in the decomposition. The estimates are:

. /* Sequential Logit */
. seqlogit finstat i.version, ///
> tree(1: 2 3 4 5 6, 2: 3 4 5 6, 3: 4 5 6, 4: 5 6, 5: 6) ///
> levels(1=0,2=0,3=0,4=0,5=0,6=1) ///
> or robust ///
> 

Transition tree:

Transition 1: 1 : 2 3 4 5 6
Transition 2: 2 : 3 4 5 6
Transition 3: 3 : 4 5 6
Transition 4: 4 : 5 6
Transition 5: 5 : 6

Computing starting values for: 

Transition 1
Transition 2
Transition 3
Transition 4
Transition 5

Iteration 0:   log pseudolikelihood =  -1204.027  
Iteration 1:   log pseudolikelihood =  -1204.027  

                                                  Number of obs   =       1470
                                                  Wald chi2(1)    =       0.88
Log pseudolikelihood =  -1204.027                 Prob > chi2     =     0.3482

------------------------------------------------------------------------------
             |               Robust
     finstat | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_2_3_4_5_6v1 |
     version |
        New  |   1.222919   .2623202     0.94   0.348     .8031785    1.862015
       _cons |   .3595506   .0741362    -4.96   0.000     .2400212     .538605
-------------+----------------------------------------------------------------
_3_4_5_6v2   |
     version |
        New  |   1.926108   1.245698     1.01   0.311     .5422151    6.842106
       _cons |   9.666667    5.86462     3.74   0.000     2.943511    31.74591
-------------+----------------------------------------------------------------
_4_5_6v3     |
     version |
        New  |   2.757144   3.069166     0.91   0.362     .3111206    24.43376
       _cons |   27.99999   28.50525     3.27   0.001     3.807117    205.9299
-------------+----------------------------------------------------------------
_5_6v4       |
     version |
        New  |   1.902564   1.483801     0.82   0.410      .412564    8.773792
       _cons |         13   9.542633     3.49   0.000      3.08404    54.79825
-------------+----------------------------------------------------------------
_6v5         |
     version |
        New  |   4.888664   2.379981     3.26   0.001     1.882756    12.69365
       _cons |   2.714286   1.200506     2.26   0.024     1.140713    6.458546
------------------------------------------------------------------------------

For each step in the funnel, we have a logit model where the estimated coefficients are transformed to odds ratios. For example, the first choice is whether to just browse (1) or continue further into the funnel (2-6). The baseline odds given by the _cons coefficient, which corresponds to the old version of the site, are $0.3595506$. This means that a lot of people drop out at this stage on the old site: roughly one third of a customer goes further for every person who merely window shops (or one for every three if the concept of fractional people conjures troubling visions of amputees). This is equivalent to the raw numbers of $(19+13)/89$.

The New coefficient gives you the multiplicative effect on the baseline odds. It's $1.222919$, so that's $22\%$ increase in the odds of continuing. So for the new site, the odds should be $1.222919 \cdot 0.3595506=0.43970126$, which matches the actual data of $(67+345)/937=0.4397$ quite nicely since we have no covariates. On the new site, almost half a user moves on past the just browsing stage. Time to get your developers some pizza and a case of beer.

However, the p-value is $0.348$ and the $95\%$ Confidence interval is $[.8031785,1.862015]$, which includes $1$. Containing 1 is unfortunate because multiplying by one leaves the baseline unchanged; less than one means the new version is worse. This means that the apparent improvement over the old site is not statistically significant, at least for this stage in the oder flow. In other words, it could be real or it could be luck. If we had more data (larger sample for the old site), we might have been able to tell the difference. You might also consider adding demographic and behavioral data (or even your segments) about your customers to this type of model as covariates. It may help with the significance.

If we look at the remaining decision stages, we see the multiplicative effects on the baseline odds are typically greater than 1, but are usually insignificant. The last stage (Drop Out at Step $5$ or Complete Purchase) has a whopping multiplicative effect of $4.888664$ with a p-value of $0.024$. Since this is the one stage that puts money in the bank, a significant increase by a factor of almost $5$ is terrific.

Calculating The Total Effect

Now we move on to the decomposition:

. /* Decomposition */
. seqlogitdecomp version, table marg

At:

    variable |   value 
-------------+---------
     version |       1 

Decomposition:

             | _2_3_4_5_6v1     | _3_4_5_6v2       | _4_5_6v3         | _5_6v4           | _6v5             
             |       b       se |       b       se |       b       se |       b       se |       b       se 
-------------+------------------+------------------+------------------+------------------+------------------
trans        |                  |                  |                  |                  |                  
  1b.version |       0        . |       0        . |       0        . |       0        . |       0        . 
   2.version |   .0409     .042 |   .0428    .0527 |   .0217    .0344 |   .0326    .0497 |    .199     .088 
-------------+------------------+------------------+------------------+------------------+------------------
weight       |                  |                  |                  |                  |                  
      weight |    .594    .0868 |    .173    .0351 |    .163    .0338 |    .169    .0345 |    .215    .0373 
     at risk |       1        . |    .264    .0401 |     .24    .0388 |    .231    .0383 |    .215    .0373 
        gain |    .594    .0868 |    .655    .0883 |    .679    .0883 |    .731     .087 |       1        . 
-------------+------------------+------------------+------------------+------------------+------------------
pr(pass)     |                  |                  |                  |                  |                  
          pr |    .264    .0401 |    .906    .0515 |    .966    .0339 |    .929    .0487 |    .731     .087 
-------------+------------------+------------------+------------------+------------------+------------------
tot          |                  |                  |                  |                  |                  
  1b.version |       0        . |                  |                  |                  |                  
   2.version |   .0987    .0329 |                  |                  |                  |

The first rows gives the effects in the standard additive probability metric (with the marg option). For me, this is easier to think about than the log odds ratio or the logit scale. For example, for the first stage, the New version effect is $0.0409$ with a standard error of $0.042$. This means a customer who encountered the new version is $4.1$ percentage points more likely to move to the add to cart stage and beyond compared to the old site.

You can use the standard error to construct your own $95\%$ confidence interval $[b\pm 1.96 \cdot se]=[-0.04142,0.12322]$, which unfortunately includes zero. You also have the effects for the other stages. The results are very similar to what we saw with multiplicative OR coefficients: only the last transition shows a significant effect (confidence interval does not include zero, since we're adding now rather than multiplying). The effect is an additional $19.9\%$, which is pretty close to $22\%$ we saw above.

The cool part is the very last row. It gives you the total effect on the probability of order completion given the effect of version on all the intermediate stages. Amazingly, this is a weighted sum of the marginal effects on each transition, where the weights are the product of two elements: the proportion "at risk", and the expected gain from passing. The first proportion is $1$, since everyone can pass from the browsing state. It gets smaller with each stage as people drop out. This makes sense: if you improve the last stage a great deal, but no one makes it there to appreciate it, you're not going to get a whole lot of a lift.

The gain line is a bit more confusing. We've assumed that you only care about the final step (the levels() in the seqlogit command assigned zero weights to all others). This means that the gain for the last stage is 1. For the first stage, we know that if someone doesn't drop out after browsing, his expected purchase probability is $0.906 \cdot 0.966 \cdot 0.929 \cdot 0.731 \cdot 1 = 0.594$. This is just the product of the probabilities of passing each stage, which compound. The probabilities are given by the pr(pass) line. For example, the first one is $(3+1+2+7+19)/121=0.264$

Then the marginal effect is

$$\frac{\Delta \mathbf{Pr} (P=1)}{\Delta v}=\sum_{s=1}^6 r_s \cdot g_s \cdot m_s, $$

where $r_s$ is the fraction at risk at stage $s$, $g_s$ is the expected gain at stage $s$, and $m_s$ is the marginal effect on stage $s$.

Now we get all the pieces, we can do the calculation

$$1\cdot.594\cdot.0409 + .264\cdot.655\cdot.0428 + .24\cdot.679\cdot.0217 + .231\cdot.731\cdot.0326+.215\cdot 1 \cdot .199=.0836$$

This doesn't quite match the total effect in the output for some reason. I thought it was most likely due to rounding error, but that does not seem to be the case. I will report back once I figure it out.

The total treatment effect in the table is $0.0987$ with a standard error of $0.033$. This means the $95\%$ confidence interval is $[0.03402,0.16338]$. That's a $10$ percentage point improvement in order completion over the old version (definitely economically significant), which also looks statistically significant (it ain't just luck). This is the one number you can show to the CFO who doesn't care about all the technical and statistical details. You can read more about how to derive this on Maarten Buis' site while the pizza and beer are being delivered.

Another nice calculation you can do is the revenue effect. You can either change the levels() option for the final stage from $1$ to the (average) price of your product or do it manually like this. If the price of your item was $\$100$ dollars and dropping out at earlier stages has no benefit for you, then your lift is worth an incremental $\$9.87$ per customer who made it to the product page.

You can also specify the benefits to you of dropping out at earlier stages (if Step 5 was worth only \$4 and purchase is worth \$100) the total effect is now only 9.72 since the new version makes attrition at the earlier stage less likely, which now carries a probabilistic cost of loosing $\$4$ dollars. This makes it easy to model upgrades.

*Here's how you replicate the Step 4 versus Step5 or Purchase of the sequential logit with a lone logit:

. recode finstat 1/3=. 4=0 5=1 6=1, gen(x)
(1470 differences between finstat and x)

. logit x i.version, or robust nolog 

Logistic regression                               Number of obs   =        414
                                                  Wald chi2(1)    =       0.68
                                                  Prob > chi2     =     0.4099
Log pseudolikelihood = -70.626428                 Pseudo R2       =     0.0042

------------------------------------------------------------------------------
             |               Robust
           x | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     version |
        New  |   1.902564   1.485092     0.82   0.410     .4120157    8.785467
       _cons |         13   9.550934     3.49   0.000     3.080183    54.86688
------------------------------------------------------------------------------

. margins, dydx(version)

Conditional marginal effects                      Number of obs   =        414
Model VCE    : Robust

Expression   : Pr(x), predict()
dy/dx w.r.t. : 2.version

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     version |
        New  |   .0325685   .0497145     0.66   0.512    -.0648703    .1300072
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

Note how this matches the OR and decomposition coefficients, respectively. The really hard part of using the sequence of logits will be getting the right standard errors for the total effect. Perhaps you can estimate all the marginal effects equations simultaneously and then calculate their weighted linear combination as above.

The Appendix of Alan Agresti's Analysis of Ordinal Categorical Data discusses various software implementations, though he calls it a continuation ratio logit.

I already upvoted this, but I'm glad to see that it is going to get some more love. It may also win the CV award as our longest answer ;-). — gung - Reinstate Monica, Apr 15 '14 at 17:51

score 3 · Answer 2 · answered Apr 06 '14 at 17:40

3

Considering you have only one realization I would assume the variables to follow a poisson law H0 the values before changes and I would check the after values against the theoretical quantiles under H0

answered Apr 06 '14 at 17:40

Giorgio Spedicato

3,444
4
29
39

What could be a statistical test for comparing funnel data before and after important changes?

2 Answers2

Linked