I'm doing a classification exercise to predict churn using random forest.
My current metrics are:
Yes - Precision: 0.68 Recall: 0.61
No - Precision: 0.86 Recall: 0.9
However, whenever I try to predict new data (where "No" should correspond to 70% of the dataset), It appears to random guesses at the data (Churn = Yes = 1).
For the predicted data, it's some sample I took from the original dataset that the model does not see. Here is the distribution of the true target values and the predicted:
True
No: 977
Yes: 106
Predicted
No: 651
Yes: 432
Threshold = 0.5
Precision: 0.12
Recall: 0.75
True Positives: 80
False Positives: 571
True Negatives: 406
False Negatives: 26
I thought it would be an overfitting problem so I fitted a gradient boosting with default parameters and I got the same behavior as above.
I ran this model in two ways:
1 - It was unbalanced, so I bootstrapped the "Yes" class.
2 - No was 70% and Yes 30%.
Both had the same behavior.
Does someone have an idea for what's possibly happening?
variables = set([X, Y, Z])
X = df[variables]
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35,random_state=45)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
clf = RandomForestClassifier(random_state=2, n_estimators = 200, min_samples_split = 5, min_samples_leaf = 1,
max_features = 'sqrt', max_depth = 60, bootstrap = True, oob_score = True)
clf.fit(X_train, y_train)
clf.score(X_test, y_test).round(4)
y_pred = clf.predict(X_test)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
Predicted 0 1 All
True
0 1045 122 1167
1 166 262 428
All 1211 384 1595
y_pred_new = clf.predict(new_data[variables].drop('id', axis = 1))