0

I'm building a Random Forest model over an unbalaced 4 class dataset. So far I understood how to use oversampling and train my model. My doubt was about when to perform Oversampling.

I've already seen a lot of questions about oversampling before or after the train/test split, and I already know that the best way is to split into train/test before and then apply oversampling.

My doubt regards this second scenario (oversamplig after splitting).

Suppose that I have already splitted my dataset in train and test with a percentage of 80%-20% and I get my X_train, y_train, X_test, y_test data.

Now I'm going to perform (for example) cross validation over my X_train in order to estimate my validation error. For example (using Python) I could have something like:

from sklearn.model_selection import cross_val_score
from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

imba_pipeline = make_pipeline(SMOTE(sampling_strategy='auto', k_neighbors=10,random_state = SEED), 
                              RandomForestClassifier(n_estimators=200, bootstrap=False,  min_samples_leaf=2, min_samples_split=2, max_depth=14, random_state=SEED, class_weight='balanced',max_features = 'sqrt'))

scores=cross_val_score(imba_pipeline, X_train, y_train, scoring='accuracy', cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Now I'm happy with my crossvalidation score and I want to train my final model.

Should I retrain it over the X_train oversampled? So basically I'll do something like:

sm = SMOTE(sampling_strategy='auto', k_neighbors=10,random_state = SEED)
X_train_upsample, y_train_upsample = sm.fit_sample(X_train, y_train)
clf=RandomForestClassifier(n_estimators=200, bootstrap=False,  min_samples_leaf=2, min_samples_split=2, max_depth=14, random_state=SEED, class_weight='balanced',max_features = 'sqrt')).fit(X_train_upsample, y_train_upsample)

Or is it a bad idea?

What if I performed crossvalidation on the already oversampled dataset? So, instead of oversampling each single fold, I have something like:

sm = SMOTE(sampling_strategy='auto', k_neighbors=10,random_state = SEED)
X_train_upsample, y_train_upsample = sm.fit_sample(X_train, y_train)

clf=RandomForestClassifier(n_estimators=200, bootstrap=False,  min_samples_leaf=2, min_samples_split=2, max_depth=14, random_state=SEED, class_weight='balanced',max_features = 'sqrt'))

scores=cross_val_score(clf, X_train_upsample, y_train_upsample, scoring='accuracy', cv=10)

1 Answers1

1

I already know that the best way is to split into train/test before and then apply oversampling.

No, it isn't. Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? Answer: they aren't, and it doesn't.

(See here for a motivation for short answers. Longer answers are always welcome. See here for a general motivation for an answer that essentially says that "the question is wrong".)

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 1
    The post seems more than reasonable. But why there is such a huge amount of post and papers talking about class unbalance? I could point out plenty of resources which talk about class unbalance and they solve it with (for example) oversamplig/undersampling and weighting methods. Who should I trust? For example, class unbalance in Image Classification problems, is definitely a problem, am i wrong? – Mattia Surricchio Sep 19 '20 at 15:30
  • In my specific case, using oversampling to add more artificial samples to the minority class (which less than half samples compared to the other classes) improves my classification performances. Obviously I'm testing my model over unseen and not upsampled data, the upsampling is done only on the training set. Is it a misleading result? – Mattia Surricchio Sep 19 '20 at 15:35
  • Regarding the multitude of materials on unbalanced classes, [see Matthew Drury (and 9 people who upvoted his comment](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he#comment672218_357466). – Stephan Kolassa Sep 19 '20 at 15:43
  • Regarding "improves my classification performances": on what KPI? Are you using accuracy? [You shouldn't.](https://stats.stackexchange.com/q/312780/1352) Are you using a proper scoring rule on probabilistic classifications? If so, I would be quite surprised if oversampling helped. Note the discussion [below my question](https://stats.stackexchange.com/q/357466/1352). – Stephan Kolassa Sep 19 '20 at 15:43
  • 2
    To be honest, I believe that a lot of noise about oversampling comes from people who, sorry, don't know statistics and don't know how to think statistically, which in turn is mainly driven by the fact that much of "ML" has been appropriated by computer scientists, not the statistical community. – Stephan Kolassa Sep 19 '20 at 15:46
  • 2
    On Cross Validated, you are likely to get more statistical answers. I submit that our answers are better than oversampling. If you disagree, I would be very interested in your arguments, because I have never seen a serious one beyond "accuracy does weird things with unbalanced data", which statisticians have a (IMO) better answer to than oversampling. – Stephan Kolassa Sep 19 '20 at 15:46
  • I'm using F1 score and then checking "manually" over a confusion matrix with true and predicted labels. Without the oversampling, the minority class has the worst performance as I intuitively expected, while I get pretty good values for the classes with a very high number of sample in my dataset – Mattia Surricchio Sep 19 '20 at 15:48
  • I'm actually a computer science student, not an expert for sure! I'm eager to learn but as i mentioned already, I found many different opinions on the subject and I couldn't figure out which one was actually better – Mattia Surricchio Sep 19 '20 at 15:50
  • 1
    @MattiaSurricchio [F1 ignores true negatives and thus depends on which class you choose as positive](https://stats.stackexchange.com/a/192765/28500). All of F1, accuracy, precision, etc depend on an assumption, often hidden and set to 0.5, about the probability cutoff from a model used to make a final class assignment. That probability cutoff should depend on the [relative costs of false positives and false negatives](https://stats.stackexchange.com/a/441734/28500). Much of what you read by computer-science/machine-learning people overlooks those issues. – EdM Sep 19 '20 at 16:02
  • @EdM i would like to point out that I'm using the f1_macro from scikit learn framework, don't know if it can be useful – Mattia Surricchio Sep 19 '20 at 16:41
  • 2
    @MattiaSurricchio [here's an example](https://stats.stackexchange.com/q/487471/28500) of a good use for some type of F-score (there's a whole family, based on tradeoffs between precision and recall). In that case one is looking for objects, and there are no true-negative objects, just an object-free background. Some text-classification schemes are similar: you primarily want to identify some texts on a particular topic out of a large corpus and don't care much about off-topic texts in the corpus so long as you find enough on the topic. So "usefulness" is very application-dependent. – EdM Sep 19 '20 at 16:53
  • In my specific case I'm classifying driving patterns such as brake, turn, accelerate and normal (constant speed on a straight road). In my case i don't really have any priority over classes (like the typical example of cancer classification and unbalanced data), but the data labeled as brake is much smaller than the other categories. Which could be a good metric? I thought of F1 since it will include both precision and recall – Mattia Surricchio Sep 19 '20 at 17:11
  • 2
    Use probabilistic class membership predictions. Evaluate these using proper scoring rules, like the Brier or the log score. These will provably (!) draw you towards correct probabilistic classifications (that's what "proper" means for scoring rules). [You can find a few links here.](https://stats.stackexchange.com/a/487448/1352) – Stephan Kolassa Sep 20 '20 at 05:16