0

I have a dataset that contains 284315 samples of class 0 and 492 of class 1. I know, that's huge. I heard about oversampling methods, so I did the following using the RandomOverSampler library:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_sample(X, y)

I trained a Random Forest classifier over this resampled data, and the confusion matrix looks like this:

array([[92005,  1833],
       [    8,   141]], dtype=int64)

So yeah, It got a 10-folds cv score of 0.9945, but the model is obviously classifying everything that it could to class 0. I know that this is a difficult problem because of the ratio, but is there anything I could do to get a more accurate performance?

Thanks!

Norhther
  • 61
  • 5

1 Answers1

1

Do not use accuracy to evaluate a classifier: Why is accuracy not the best measure for assessing classification models?

Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 1
    Thanks for your answer. After reading a little bit, I changed the oversampling approach. Now I penalize the bigger class with class_weight="balanced". Also tuning the hyperparameters reduced the false negative to 3! – Norhther Jul 18 '18 at 11:01