Predicting customer churn - train & test sets

Question

I'm struggling with a problem where I'm trying to predict customer churn. I have monthly snapshot data going back several years, and tags for whether a customer left during a given month.

My main question is whether I should be using the entire dataset as my training set? For example, take the March 2014 end-of-month snapshot, and train on whether a customer left or not. That March 2014 EOM snapshot includes the March EOM data for all current customers (or those that left in March), and the time-shifted data for any customers that left prior to March 2014. My thinking is that I CAN use the entire dataset, rather than reserving a test set, because effectively my test set can be the snapshot for April 2014 or May 2014. (Or August 2014, for that matter.)

I want to use the whole snapshot for training because there is a relatively low churn rate (0.02% in a given month). I've tried splitting off a Test set from the Train, and that usually shows good model performance on the Test set. But terrible performance on the subsequent months... (That's probably my real question, but I figured I'd start with getting the Train / Test thing settled.)

Just because you plan to use R does not make this a specific programming question. Ideally you should know what analysis is appropriate for your data and if you didn't know *how to implement* that in R, then that might be on-topic for this site. As written, this question seems like you are in need of statistical advice which is better handled at [stats.se]. — , Sep 25 '14 at 19:18
Close vote explanation: Please review the material available in the SO help files. This does not appear to be a coding question (since there is no code or data) but rather a request for statistical consultation (or perhaps for homework assistance). — DWin, Sep 25 '14 at 19:18
Migration is a good idea but the SO flagging process seems not to work very well. I often see my suggestions to migrate ignored by questioners and the mods seem to get annoyed by all the requests. — DWin, Sep 25 '14 at 19:40
Who deleted my comment stating this was a good on-topic question for [CrossValidated](http://stats.stackexchange.com) ? — smci, Sep 25 '14 at 22:39

smci · Answer 1 · 2020-04-06T01:32:33.987

It depends entirely on whether historical churn in 2013 is a good predictor of churn in mid-2014, i.e. whether the training-set was predictive of test-set behavior.

In general you should assume no. [*]

(Obviously the actual individual customers churning are different. But do they churn for different reasons? duration? cost? product usage? etc. Did those customers come in through a trial? a social-media campaign? word-of-mouth? Was that different to how previous customers were acquired? Do you have the right features to capture those? At least you're using random-forest classification instead of linear-regression? Pay attention to Feature Selection: generate a ton of plausible candidate features, then use a legitimate Feature Selection procedure e.g. VIF.)

[*] Why assume no? You haven't said what product domain it was (music website? insurance? etc. well what is it?), but things change over time, prices rise, product features get changed, competitors appear etc. This is called feature creep. Maybe all the original customers came in on a 12-month subscription, then the renewal price rose.

I've tried splitting off a Test set from the Train, and that usually shows good model performance on the Test set. But terrible performance on the subsequent months... (That's probably my real question, but I figured I'd start with getting the Train / Test thing settled.)

Ok well that's useful actionable information you should pay attention to. That's a good thing not a bad thing. Start digging into it and tell us more details: How exactly did you split the test set? First n rows? By customer id? By join-date? By alphabetical last name? username? By subscriber-price? Randomly stratified? Chances are you naively chose some split criterion which introduced bias. Try different split-criteria and show us what results you get. Tell us more (add it to your question details above) - the more information you give us the more help we can be.

Predicting customer churn - train & test sets

1 Answers1

Linked