0

I have a dataset that's isn't balanced (94% no, 6% yes). To balance out the data, I'm trying to use smote. However, when I do, it'd predicting 30% - 50% "yes" depending on the variables and settings I use on the test set. After creating the model, which the confusion matrix is showing is about 75% accurate on the test set (the training set is 77% accurate), I test it on a completely different csv file of 100,000 customers. As mentioned above, I'm getting anywhere between 30,000 -50,000 positives, when it should only be around 6,000.

After changing some settings in the model, I added cross validation thinking it would help, now the model is only predicting 300 positives, which is .03%, well below what I'm expecting. Statistically speaking, the model should predict around 6% "yes," unless I'm an idiot and going about it all wrong. My best guess is I'm doing the cross validation incorrectly. Any advice?

Thanks. Code is below.


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RepeatedStratifiedKFold, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix,roc_curve, roc_auc_score, precision_score, recall_score, precision_recall_curve
from sklearn.metrics import f1_score

X=df[[ 'loancount', 'sharebalance', 'CreditScore','YearsAccountOpen', 'OpenLoanCount','Age','AvgTransactionCount','DebtRatio']]
y=df['OpenedLCInd']

# standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

# split into training and testing datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 12, shuffle = True, stratify = y)
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  
    X_test = X[test_index]
    y_test = y[test_index]    
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_resample(X_train, y_train)
    model = LogisticRegression(solver = 'lbfgs')
    model.fit(X_train_oversampled, y_train_oversampled )  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

##Predict on new CSV File

predictions = model.predict(X_test)


test_predictions = model.predict(test[[ 'loancount', 'sharebalance', 'CreditScore','YearsAccountOpen', 'OpenLoanCount','Age','AvgTransactionCount','DebtRatio']])
                                
test["predictions"] = test_predictions

test.to_csv(r'models\smote test6.csv')
````
  • 1
    Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en https://stats.stackexchange.com/q/283170/247274 – Dave Jun 18 '21 at 03:36
  • This question on class balancing has been closed, but it has a very nice list of relevant links https://stats.stackexchange.com/q/364361/4598 – cbeleites unhappy with SX Jun 18 '21 at 09:03
  • Thanks for the links. But even without using smote, after cross validation, I'm getting a less than 1% predicted "Yes" with my model. – user14316330 Jun 18 '21 at 14:46

0 Answers0