Caveats
Most importantly, a lot of things could have changed since you made the upgrade. For instance, if a major competitor went out of business and Christmas rolled around in time for the change, your test will be conflating the positive effect of those factors with the site improvements. They could even mask that the new version is bad. It would have been a much better experimental design to split some fraction of your site traffic into the new version at random, and then compare the outcomes for those who saw the new and the old site during the same period. This holds seasonality and competition constant, so you're not comparing apples to orangutans. Not a whole lot can be done to address this (at least with the data you provided).
We are also going to assume that customers do not interact with each other (stockouts aren't possible, for instance) and that there are no repeat visits. We also assume that people can't skip steps in the funnel and that they can't enter the funnel anywhere but the beginning. These might be unrealistic assumptions for some web sites, but relaxing them complicates things.
Model and Data Transformations
If you are willing to ignore these concerns, you might try using a sequential logit model to analyze the flow through the funnel. This approach is sometimes called the sequential response model, continuation ratio logit, model for nested dichotomies, or the Mare model (after the UCLA sociologist). It has more names than an identity fraudster!
This approach requires data at individual customer level, which I can back out from the table you provided by differencing. This will handle the fact that traffic is higher on the new site. I will be using Stata and Maarten Buis' seqlogit
command. Some part of this can be handled with a series of logit commands* (which any advanced statistical software should be able to handle), but seqlogit
allows you to do a very cool "decomposition" of the total effect of the version on the final outcome (like purchase probability or even a dollar amount).
I've already changed your data from a Napoleon's Campaign format to final outcome by differencing adjacent cells along each column:
. /* Set Up The Data */
. set more off
. clear
. input str10 finstat byte version int n
finstat version n
1. "Browsed" 1 89
2. "Clicked" 1 3
3. "Step3" 1 1
4. "Step4" 1 2
5. "Step5" 1 7
6. "Purchased" 1 19
7. "Browsed" 2 937
8. "Clicked" 2 21
9. "Step3" 2 5
10. "Step4" 2 15
11. "Step5" 2 26
12. "Purchased" 2 345
13. end
. sencode finstat, replace
. lab define version 1 "Old" 2 "New"
. lab val version version
. expand n
(1458 observations created)
. sort version finstat
. drop n
This gives us 1470 observations/rows (I could have also used frequency weights instead of expanding the data). For instance, for the Old version, we have 89 people who only browsed and 19 who completed a purchase and 13 who dropped out somewhere in between.
. tab finstat version
| version
finstat | Old New | Total
-----------+----------------------+----------
Browsed | 89 937 | 1,026
Clicked | 3 21 | 24
Step3 | 1 5 | 6
Step4 | 2 15 | 17
Step5 | 7 26 | 33
Purchased | 19 345 | 364
-----------+----------------------+----------
Total | 121 1,349 | 1,470
Sequential Logit For Each Step In The Order Funnel
Now we fit the sequential logit of final outcome on binary version. The tree()
option specifies the structure of the decision tree (the choices available at the $6$ junctures), and the levels()
option specifies the revenue effects of stopping at each stage. Here I assume you only care about the completed purchase. This assumption does not alter the estimation, but will be used in the decomposition. The estimates are:
. /* Sequential Logit */
. seqlogit finstat i.version, ///
> tree(1: 2 3 4 5 6, 2: 3 4 5 6, 3: 4 5 6, 4: 5 6, 5: 6) ///
> levels(1=0,2=0,3=0,4=0,5=0,6=1) ///
> or robust ///
>
Transition tree:
Transition 1: 1 : 2 3 4 5 6
Transition 2: 2 : 3 4 5 6
Transition 3: 3 : 4 5 6
Transition 4: 4 : 5 6
Transition 5: 5 : 6
Computing starting values for:
Transition 1
Transition 2
Transition 3
Transition 4
Transition 5
Iteration 0: log pseudolikelihood = -1204.027
Iteration 1: log pseudolikelihood = -1204.027
Number of obs = 1470
Wald chi2(1) = 0.88
Log pseudolikelihood = -1204.027 Prob > chi2 = 0.3482
------------------------------------------------------------------------------
| Robust
finstat | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_2_3_4_5_6v1 |
version |
New | 1.222919 .2623202 0.94 0.348 .8031785 1.862015
_cons | .3595506 .0741362 -4.96 0.000 .2400212 .538605
-------------+----------------------------------------------------------------
_3_4_5_6v2 |
version |
New | 1.926108 1.245698 1.01 0.311 .5422151 6.842106
_cons | 9.666667 5.86462 3.74 0.000 2.943511 31.74591
-------------+----------------------------------------------------------------
_4_5_6v3 |
version |
New | 2.757144 3.069166 0.91 0.362 .3111206 24.43376
_cons | 27.99999 28.50525 3.27 0.001 3.807117 205.9299
-------------+----------------------------------------------------------------
_5_6v4 |
version |
New | 1.902564 1.483801 0.82 0.410 .412564 8.773792
_cons | 13 9.542633 3.49 0.000 3.08404 54.79825
-------------+----------------------------------------------------------------
_6v5 |
version |
New | 4.888664 2.379981 3.26 0.001 1.882756 12.69365
_cons | 2.714286 1.200506 2.26 0.024 1.140713 6.458546
------------------------------------------------------------------------------
For each step in the funnel, we have a logit model where the estimated coefficients are transformed to odds ratios. For example, the first choice is whether to just browse (1) or continue further into the funnel (2-6). The baseline odds given by the _cons coefficient, which corresponds to the old version of the site, are $0.3595506$. This means that a lot of people drop out at this stage on the old site: roughly one third of a customer goes further for every person who merely window shops (or one for every three if the concept of fractional people conjures troubling visions of amputees). This is equivalent to the raw numbers of $(19+13)/89$.
The New coefficient gives you the multiplicative effect on the baseline odds. It's $1.222919$, so that's $22\%$ increase in the odds of continuing. So for the new site, the odds should be $1.222919 \cdot 0.3595506=0.43970126$, which matches the actual data of $(67+345)/937=0.4397$ quite nicely since we have no covariates. On the new site, almost half a user moves on past the just browsing stage. Time to get your developers some pizza and a case of beer.
However, the p-value is $0.348$ and the $95\%$ Confidence interval is $[.8031785,1.862015]$, which includes $1$. Containing 1 is unfortunate because multiplying by one leaves the baseline unchanged; less than one means the new version is worse. This means that the apparent improvement over the old site is not statistically significant, at least for this stage in the oder flow. In other words, it could be real or it could be luck. If we had more data (larger sample for the old site), we might have been able to tell the difference. You might also consider adding demographic and behavioral data (or even your segments) about your customers to this type of model as covariates. It may help with the significance.
If we look at the remaining decision stages, we see the multiplicative effects on the baseline odds are typically greater than 1, but are usually insignificant. The last stage (Drop Out at Step $5$ or Complete Purchase) has a whopping multiplicative effect of $4.888664$ with a p-value of $0.024$. Since this is the one stage that puts money in the bank, a significant increase by a factor of almost $5$ is terrific.
Calculating The Total Effect
Now we move on to the decomposition:
. /* Decomposition */
. seqlogitdecomp version, table marg
At:
variable | value
-------------+---------
version | 1
Decomposition:
| _2_3_4_5_6v1 | _3_4_5_6v2 | _4_5_6v3 | _5_6v4 | _6v5
| b se | b se | b se | b se | b se
-------------+------------------+------------------+------------------+------------------+------------------
trans | | | | |
1b.version | 0 . | 0 . | 0 . | 0 . | 0 .
2.version | .0409 .042 | .0428 .0527 | .0217 .0344 | .0326 .0497 | .199 .088
-------------+------------------+------------------+------------------+------------------+------------------
weight | | | | |
weight | .594 .0868 | .173 .0351 | .163 .0338 | .169 .0345 | .215 .0373
at risk | 1 . | .264 .0401 | .24 .0388 | .231 .0383 | .215 .0373
gain | .594 .0868 | .655 .0883 | .679 .0883 | .731 .087 | 1 .
-------------+------------------+------------------+------------------+------------------+------------------
pr(pass) | | | | |
pr | .264 .0401 | .906 .0515 | .966 .0339 | .929 .0487 | .731 .087
-------------+------------------+------------------+------------------+------------------+------------------
tot | | | | |
1b.version | 0 . | | | |
2.version | .0987 .0329 | | | |
The first rows gives the effects in the standard additive probability metric (with the marg
option). For me, this is easier to think about than the log odds ratio or the logit scale. For example, for the first stage, the New version effect is $0.0409$ with a standard error of $0.042$. This means a customer who encountered the new version is $4.1$ percentage points more likely to move to the add to cart stage and beyond compared to the old site.
You can use the standard error to construct your own $95\%$ confidence interval $[b\pm 1.96 \cdot se]=[-0.04142,0.12322]$, which unfortunately includes zero. You also have the effects for the other stages. The results are very similar to what we saw with multiplicative OR coefficients: only the last transition shows a significant effect (confidence interval does not include zero, since we're adding now rather than multiplying). The effect is an additional $19.9\%$, which is pretty close to $22\%$ we saw above.
The cool part is the very last row. It gives you the total effect on the probability of order completion given the effect of version on all the intermediate stages. Amazingly, this is a weighted sum of the marginal effects on each transition, where the weights are the product of two elements: the proportion "at risk", and the expected gain from passing. The first proportion is $1$, since everyone can pass from the browsing state. It gets smaller with each stage as people drop out. This makes sense: if you improve the last stage a great deal, but no one makes it there to appreciate it, you're not going to get a whole lot of a lift.
The gain line is a bit more confusing. We've assumed that you only care about the final step (the levels()
in the seqlogit
command assigned zero weights to all others). This means that the gain for the last stage is 1. For the first stage, we know that if someone doesn't drop out after browsing, his expected purchase probability is $0.906 \cdot 0.966 \cdot 0.929 \cdot 0.731 \cdot 1 = 0.594$. This is just the product of the probabilities of passing each stage, which compound. The probabilities are given by the pr(pass) line. For example, the first one is $(3+1+2+7+19)/121=0.264$
Then the marginal effect is
$$\frac{\Delta \mathbf{Pr} (P=1)}{\Delta v}=\sum_{s=1}^6 r_s \cdot g_s \cdot m_s, $$
where $r_s$ is the fraction at risk at stage $s$, $g_s$ is the expected gain at stage $s$, and $m_s$ is the marginal effect on stage $s$.
Now we get all the pieces, we can do the calculation
$$1\cdot.594\cdot.0409 + .264\cdot.655\cdot.0428 + .24\cdot.679\cdot.0217 + .231\cdot.731\cdot.0326+.215\cdot 1 \cdot .199=.0836$$
This doesn't quite match the total effect in the output for some reason. I thought it was most likely due to rounding error, but that does not seem to be the case. I will report back once I figure it out.
The total treatment effect in the table is $0.0987$ with a standard error of $0.033$. This means the $95\%$ confidence interval is $[0.03402,0.16338]$. That's a $10$ percentage point improvement in order completion over the old version (definitely economically significant), which also looks statistically significant (it ain't just luck). This is the one number you can show to the CFO who doesn't care about all the technical and statistical details. You can read more about how to derive this on Maarten Buis' site while the pizza and beer are being delivered.
Another nice calculation you can do is the revenue effect. You can either change the levels()
option for the final stage from $1$ to the (average) price of your product or do it manually like this. If the price of your item was $\$100$ dollars and dropping out at earlier stages has no benefit for you, then your lift is worth an incremental $\$9.87$ per customer who made it to the product page.
You can also specify the benefits to you of dropping out at earlier stages (if Step 5 was worth only \$4 and purchase is worth \$100) the total effect is now only 9.72 since the new version makes attrition at the earlier stage less likely, which now carries a probabilistic cost of loosing $\$4$ dollars. This makes it easy to model upgrades.
*Here's how you replicate the Step 4 versus Step5 or Purchase of the sequential logit with a lone logit:
. recode finstat 1/3=. 4=0 5=1 6=1, gen(x)
(1470 differences between finstat and x)
. logit x i.version, or robust nolog
Logistic regression Number of obs = 414
Wald chi2(1) = 0.68
Prob > chi2 = 0.4099
Log pseudolikelihood = -70.626428 Pseudo R2 = 0.0042
------------------------------------------------------------------------------
| Robust
x | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
version |
New | 1.902564 1.485092 0.82 0.410 .4120157 8.785467
_cons | 13 9.550934 3.49 0.000 3.080183 54.86688
------------------------------------------------------------------------------
. margins, dydx(version)
Conditional marginal effects Number of obs = 414
Model VCE : Robust
Expression : Pr(x), predict()
dy/dx w.r.t. : 2.version
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
version |
New | .0325685 .0497145 0.66 0.512 -.0648703 .1300072
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
Note how this matches the OR and decomposition coefficients, respectively. The really hard part of using the sequence of logits will be getting the right standard errors for the total effect. Perhaps you can estimate all the marginal effects equations simultaneously and then calculate their weighted linear combination as above.
The Appendix of Alan Agresti's Analysis of Ordinal Categorical Data discusses various software implementations, though he calls it a continuation ratio logit.