1

I'm doing a classification exercise to predict churn using random forest.

My current metrics are:

Yes - Precision: 0.68 Recall: 0.61

No - Precision: 0.86 Recall: 0.9

However, whenever I try to predict new data (where "No" should correspond to 70% of the dataset), It appears to random guesses at the data (Churn = Yes = 1).

For the predicted data, it's some sample I took from the original dataset that the model does not see. Here is the distribution of the true target values and the predicted:

True

No: 977

Yes: 106

Predicted

No: 651

Yes: 432

Threshold = 0.5 Precision: 0.12 Recall: 0.75 True Positives: 80
False Positives: 571 True Negatives: 406 False Negatives: 26

I thought it would be an overfitting problem so I fitted a gradient boosting with default parameters and I got the same behavior as above.

I ran this model in two ways:

1 - It was unbalanced, so I bootstrapped the "Yes" class.

2 - No was 70% and Yes 30%.

Both had the same behavior.

Does someone have an idea for what's possibly happening?

variables = set([X, Y, Z])

X = df[variables]

y = df['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35,random_state=45)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

clf = RandomForestClassifier(random_state=2, n_estimators = 200, min_samples_split = 5,  min_samples_leaf = 1,
                             max_features = 'sqrt', max_depth = 60, bootstrap = True, oob_score = True)
clf.fit(X_train, y_train)


clf.score(X_test, y_test).round(4)

y_pred = clf.predict(X_test)

y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted   0   1   All
True            
0   1045    122 1167
1   166 262 428
All 1211    384 1595

y_pred_new = clf.predict(new_data[variables].drop('id', axis = 1))

0 Answers0