2

Im trying to predict customer churn in a non-contractual setting, which mean we cannot see exactly when the customer is churning.

Therefore we have created our Y variable (churn) by saying: if days_since_last_purchase > 365 days then 1 (churned) otherwise 0.

My question is therefore... can i still use the variable days_since_last_purchase as explanatory (x variable), eventough it is used to create the Y variable? When including days_since_last_purchase the model performs better and is number 1 in variable importance.

The models we are using are: Logistic Regression, Random Forest and XGBoost

Thanks for your help


I do not end up with a perfect model. The models has an accuracy of 0.92 from the confusion matrix with the variable and 0.82 without. So when I use random forest and XGBoost i should not be concerned about this, because it will penalize the correlation? We has historic transaction data 5 years back.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 3
    Do you end up with a perfect model? If days_since_last_purchase > 365 you predict Y=1 and otherwise 0, and it always turns out to give correct predictions. Random Forrest would love this, while logistic regression might break. – Henry May 13 '21 at 17:04
  • What time periods are your X and Y variables covering (both during training and prediction respectively)? And when you say "the model performs better", what does that refer to exactly? – B.Liu May 14 '21 at 08:47
  • I do not end up with a perfect model. The models has an accuracy of 0.92 from the confusion matrix with the variable and 0.82 without. So when I use random forest and XGBoost i should not be concerned about this, because it will penalize the correlation? We has historic transaction data 5 years back. – Søren Therkildsen May 14 '21 at 08:51
  • Sorry, should have been clearer with my question: Say if you train the model today (or at time $t$), what transaction data are you including when you build your explanatory and target variables during training and prediction respectively? I know some do $t$-5years to $t$-1year for explanatory variables and $t$-1year to $t$ for target variable during training, and then shift the two time periods forward a year to do live predictions. The reason I ask that is because what you propose may or may not make sense depending on how you split the data temporally. – B.Liu May 14 '21 at 09:06
  • Further clarification question: When you say an accuracy of 0.92, can I assume that is on some validation set? And is your churn/non churn class balanced or unbalanced? Accuracy can be a pretty bad metric for unbalanced datasets. – B.Liu May 14 '21 at 09:08
  • A ['Buy Till You Die' model](https://en.m.wikipedia.org/wiki/Buy_Till_you_Die) might be worth considering. – Scortchi - Reinstate Monica May 14 '21 at 09:19

0 Answers0