Imbalanced Classes: Random Forests w/ individually balanced trees?

Question

Assume you’re dealing with an imbalanced dataset. I know I can do things like upsampling, downsampling, and synthetic sampling to build out my train and test split. My question is: if I’m using a random forest classifier, are there any implementations in R or Python that would force each of the randomly generated trees that it will be evaluating against such that it has balanced classes?

score 1 · Answer 1 · answered Jun 28 '17 at 20:16

1

Apparently what this is describing is called "Balanced Random Forests."

There is a separate stack page that mentions a corresponding R package: Implementing Balanced Random Forest (BRF) in R using RandomForests

answered Jun 28 '17 at 20:16

Afflatus

141
6

Sycorax · Answer 2 · 2018-06-10T15:57:22.447

1

sklearn.ensemble.RandomForestClassifier accepts an argument class_weight that allows you to control how the samples are weighted, either globally or for each tree. In particular,

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

which seems to be exactly what you're asking about.

edited Jun 10 '18 at 15:57

answered Jun 10 '18 at 15:51

Sycorax

76,417
20
189
313

score 0 · Answer 3 · answered Jun 10 '19 at 14:14

Maybe you can try to make trees subample balanced

def set_rf_balanced_subsampling(y_tt_labels):
    """ Changes Scikit learn's random forests to give each tree a balanced random sample of
    n random rows.
    """
    each_tree_class_samples = y_tt_labels.value_counts().min()

    indices = {}
    for tt_label in ["H", "S", "BC"]:
        indices[tt_label] = y_tt_labels[y_tt_labels == tt_label].index.values

    def balanced_sampling(rs, n_samples):
        return np.concatenate([forest.check_random_state(rs).choice(indices["BC"], each_tree_class_samples, replace=True),
                               forest.check_random_state(rs).choice(indices["H"], each_tree_class_samples, replace=True),
                               forest.check_random_state(rs).choice(indices["S"], each_tree_class_samples, replace=True)])

    forest._generate_sample_indices = balanced_sampling

Also, I recommend you to check the Imblearn library and combine the Pipeline methods with RandomOverSample methods.

Imbalanced Classes: Random Forests w/ individually balanced trees?

3 Answers3