Im trying to predict customer churn in a non-contractual setting, which mean we cannot see exactly when the customer is churning.
Therefore we have created our Y variable (churn) by saying: if days_since_last_purchase > 365 days then 1 (churned) otherwise 0.
My question is therefore... can i still use the variable days_since_last_purchase as explanatory (x variable), eventough it is used to create the Y variable? When including days_since_last_purchase the model performs better and is number 1 in variable importance.
The models we are using are: Logistic Regression, Random Forest and XGBoost
Thanks for your help
I do not end up with a perfect model. The models has an accuracy of 0.92 from the confusion matrix with the variable and 0.82 without. So when I use random forest and XGBoost i should not be concerned about this, because it will penalize the correlation? We has historic transaction data 5 years back.