Predicting churn in non-contractual setting - correlation problem

Question

Im trying to predict customer churn in a non-contractual setting, which mean we cannot see exactly when the customer is churning.

Therefore we have created our Y variable (churn) by saying: if days_since_last_purchase > 365 days then 1 (churned) otherwise 0.

My question is therefore... can i still use the variable days_since_last_purchase as explanatory (x variable), eventough it is used to create the Y variable? When including days_since_last_purchase the model performs better and is number 1 in variable importance.

The models we are using are: Logistic Regression, Random Forest and XGBoost

Thanks for your help

I do not end up with a perfect model. The models has an accuracy of 0.92 from the confusion matrix with the variable and 0.82 without. So when I use random forest and XGBoost i should not be concerned about this, because it will penalize the correlation? We has historic transaction data 5 years back.

Do you end up with a perfect model? If days_since_last_purchase > 365 you predict Y=1 and otherwise 0, and it always turns out to give correct predictions. Random Forrest would love this, while logistic regression might break. — Henry, May 13 '21 at 17:04
What time periods are your X and Y variables covering (both during training and prediction respectively)? And when you say "the model performs better", what does that refer to exactly? — B.Liu, May 14 '21 at 08:47
I do not end up with a perfect model. The models has an accuracy of 0.92 from the confusion matrix with the variable and 0.82 without. So when I use random forest and XGBoost i should not be concerned about this, because it will penalize the correlation? We has historic transaction data 5 years back. — Søren Therkildsen, May 14 '21 at 08:51
Sorry, should have been clearer with my question: Say if you train the model today (or at time $t$), what transaction data are you including when you build your explanatory and target variables during training and prediction respectively? I know some do $t$-5years to $t$-1year for explanatory variables and $t$-1year to $t$ for target variable during training, and then shift the two time periods forward a year to do live predictions. The reason I ask that is because what you propose may or may not make sense depending on how you split the data temporally. — B.Liu, May 14 '21 at 09:06
Further clarification question: When you say an accuracy of 0.92, can I assume that is on some validation set? And is your churn/non churn class balanced or unbalanced? Accuracy can be a pretty bad metric for unbalanced datasets. — B.Liu, May 14 '21 at 09:08
A ['Buy Till You Die' model](https://en.m.wikipedia.org/wiki/Buy_Till_you_Die) might be worth considering. — Scortchi - Reinstate Monica, May 14 '21 at 09:19

Predicting churn in non-contractual setting - correlation problem

0 Answers0