I'm building a machine learning model to predict a process failure (1=fail, 0=no-fail). To begin with, I have a class imbalance ratio of 1:51. After some wrangling, I applied Clustering-Based Oversampling to fix the imbalance (ratio 1:1.001). Then I tried a logistic regression model and this were the results:
precision recall f1-score support
0 0.98 0.93 0.95 10638
1 0.04 0.15 0.06 208
accuracy 0.91 10846
macro avg 0.51 0.54 0.51 10846
weighted avg 0.96 0.91 0.94 10846
I want to know what does a high F1 score for 0 and low F1 score for 1 means before I go any further experimenting with different algorithms.
Info about the dataset:
- 22 predictive features:
- 1 numerical continuous, independently normalized with np.log
- 2 numerical binary (0, 1)
- 2 numerical ordinal, ranges 1-4 and 0-2 respectively
- 17 numerical binary that were one-hot encoded from 17 classes (I didn't apply n-1 variables)
Info about multicollinearity:
- I have 6 variables with VIF ranging from 2 to 6.
Info about the model:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train.values.ravel())