4

I'm running an AB Test and using Survival Analysis to estimate the Return Rate to the website.

Each day (in a total of 7 days) I randomly assign 100 users to group A and 100 users to group B and after 7 days I stop the test.

Each day I also check the number of tracked users (only the ones participating in the experiment) who returned to the website.

At the end of the experiment, there will be:

  • 7 cohorts that passed through day 0
  • 6 cohorts through day 1
  • 5 cohorts through day 2
  • ...
  • 1 cohort through day 6

by Cohort I mean: on the first day of the experiment I assign 100 users to each group and that's my first cohort. On the second day of the experiment I assign another 100 users to each group, and these users are my second cohort, and so on.

As an example, the Return Rate of group A is simply the number of users of group A who returned to the website on day $i$ divided by the number of users that reached that day.

$$ RR_i = \frac{n_i}{N_i}$$

  • $RR_i$ is the Return Rate on day $i$
  • $N_i$ is the number of users that reached day $i$ (if $i = 2$, the number of cohorts that reached that day is 5, if each cohort has 100 users, then $N_2 = 500$)
  • $n_i$ the number of users that returned on that day.

When using Survival Analysis, we assume that when an individual die, of course it never comes back. But in my analysis, an user that comes to the website on day 0, can return on day 1 and 2, or return on day 1, 3 and 6, or even return each of the following 7 days.

In that case, can I still use Survival Analysis?

  1. If so, what else do I need to consider in order to still make it valid?
  2. If not, what better approaches can I use?
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Thiago
  • 389
  • 1
  • 2
  • 10
  • 1
    Perhaps providing a quantitative definition of "Return Rate" might help clear things up. If you had a perfect crystal ball that could show you all details of your website usage throughout the future, so that there are no concerns about sampling or uncertainty, exactly how would you compute the Return Rate? – whuber Jun 02 '15 at 19:55
  • 3
    I would like to suggest that it can be helpful to distinguish terms of importance to survival analysis from terms of importance to *you.* In order not to get stuck in potentially irrelevant ideas, could we forget about your experiment and survival analysis for a moment? What exactly do you want to learn about your website users? Would it be something like, say, a function $f$ for which $f(i)$ tells you what proportion returns $i$ days after first encountering the site? – whuber Jun 02 '15 at 20:24
  • 1
    Are you trying to predict how many will return ? if so survival analysis is not the right approach. You might want to assume some type of probability distributions for the following: Some will return, some will never return and there are differences (heterogeneity) among people who are coming to your website and model this as a mixture model. I read an article in European journal of operational research or computers and operations research about a model. If i find it, I'll share it. – forecaster Jun 02 '15 at 20:29
  • The objetive is to compare the Return Rate of the two groups and decide each one is better day-by-day. The output then would be given as a ratio between $\frac{RR_Bi}{RR_Ai} - 1$ where A is my control group and B the experiment group. I also would want to compute the $SE$ of that ratio. – Thiago Jun 02 '15 at 20:37
  • Great @forecaster! It would be great to read that article! Btw, thinking about this problem some days ago I came up with - I guess wrong - solution to this question 'cause I wasn't considering the time. If it's something of interest, here is the question's link. – Thiago Jun 02 '15 at 21:08
  • Given your time is discrete, you can just use logistic regression. And then add whatever input variables you wish to use in model. Eg number of previous return days – seanv507 Jul 03 '19 at 07:15
  • IE you predict probability of returning next day given.... – seanv507 Jul 03 '19 at 07:16
  • 1
    There is a particular kind of survival analysis called `multi-state` analysis that might be appropriate here, if censoring is the important factor. An R tutorial on the matter is https://cran.r-project.org/web/packages/survival/vignettes/compete.pdf – Wayne Aug 02 '20 at 14:01
  • On the first day (I assume day 0) you have only one single cohort. What do you mean by *'7 cohorts passes through day 0'*? – Sextus Empiricus Sep 20 '21 at 07:52
  • What does return rate on day $i$ mean? You count the total people in the cohorts and the fraction from those that return that particular day. But what is special about that day number $i$? The cohorts are not the same age are they (because the cohorts are not generated on the same days)? So the day number $i$ does not refer to age of the cohort. – Sextus Empiricus Sep 20 '21 at 07:55

1 Answers1

1

If you modeled this with Poisson regression, you would still be able to incorporate the time aspect and not violate the assumptions of the model since the outcome can accommodate multiple hits. You would be able to model event rates (number of events per persons at risk per unit time). Poisson regression is the "other proportional hazards model", the more widely known one be Cox regression. In R you will need to include a strata() term if you have multiple data lines with the same subject ID. (I'm sure other full-featured stats programs will have similar capabilities.)

The UCLA site has a series of worked examples at: https://stats.idre.ucla.edu/r/dae/poisson-regression/

Achim Zeileis, Christian Kleiber, and Simon Jackman have a nice tutorial at https://www.jstatsoft.org/article/view/v027i08

The source I learned it from was in Breslow and Day's second of two-volume monograph, "Statistical methods in cancer research. Volume II--The design and analysis of cohort studies". The IARC kindly makes it freely available at: https://publications.iarc.fr/_publications/media/download/3494/fb469ed43c52f0c738915cca6a0f31544b9ed7b6.pdf The code is unfortunately not in a modern stats language, rather it's in GLIM. I found learning R rather easy after coming to the task from using GLIM. So if you are an R user perhaps the code in that source would still be useful. Certainly its discussion of validity concerns in cohort analysis would still be relevant.

DWin
  • 7,005
  • 17
  • 32