I have half a year long data of purchases in e-commerce site (100K purchases by 60K customers). In simplest A/B testing framework random customers got a discount on the next purchase after order completion ( disc
= 0/1 below). I want to estimate to what extent disc
influenced an interval between orders (diff
).
Following KISS principle, I just drop all 1-time customers and regress ln(diff
) on disc
, but don't observe the effect at all. I have two obvious problems:
- data is very censored, 60% of customers appeared only once
- selection bias - frequent buyers had more chances to get a discount
To address (1), I turn to Cox-model coxph(Surv(diff, event) ~ disc + cluster(customer_id))
(and observe the effect!), but can't figure out whether it is the best method to handle multiple failure times(purchases) per customer. For (2), I'm thinking of introducing lagged diff
, but don't know how to do it in a censored case.
There were number of relevant discussions (RFM & customer lifetime value modeling in R , Survival Analysis with Multiple Events), but I fail to find a solution for my problems. There is also BTYD package, but it's not parametric. Guess this is very standard question, but can't find a step-by-step (CrossValidated) guide