0

I'm building a machine learning model to predict a process failure (1=fail, 0=no-fail). To begin with, I have a class imbalance ratio of 1:51. After some wrangling, I applied Clustering-Based Oversampling to fix the imbalance (ratio 1:1.001). Then I tried a logistic regression model and this were the results:

          precision    recall  f1-score   support

       0       0.98      0.93      0.95     10638
       1       0.04      0.15      0.06       208

accuracy                           0.91     10846
macro avg       0.51      0.54     0.51     10846
weighted avg    0.96      0.91     0.94     10846

I want to know what does a high F1 score for 0 and low F1 score for 1 means before I go any further experimenting with different algorithms.

Info about the dataset:

  • 22 predictive features:
  • 1 numerical continuous, independently normalized with np.log
  • 2 numerical binary (0, 1)
  • 2 numerical ordinal, ranges 1-4 and 0-2 respectively
  • 17 numerical binary that were one-hot encoded from 17 classes (I didn't apply n-1 variables)

Info about multicollinearity:

  • I have 6 variables with VIF ranging from 2 to 6.

Info about the model:

LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train.values.ravel())
  • 1
    [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Jun 22 '21 at 20:40
  • The main surprise to me is that even after balancing (primarily in effect shifting the decision threshold), the model is only predicting ≈776 positive cases. A model built on 50%-positive data should be predicting roughly 50% positive cases, unless your test set is significantly unlike the (unbalanced) training set. – Ben Reiniger Jun 23 '21 at 17:22

0 Answers0