1

I have the following problem and I'm desperately looking for a solution since several weeks, so I'm hoping to find some help here. In this post, I'm always refering to Sklearn.

My goal: I want to classify a dataset with 4 classes which is imbalanced. The ratio of datapoints is approximately 1:1:1:5. I have a total number of 5165 datapoints and 632 features. The final metric I want to optimize is the balanced accuracy score (= average recall rate over all classes).

My current status: I tried using the calculation pipeline below to solve the problem. With a SVC using an RBF kernel and doing a gridsearch for hyperparameter optimization, I reached a sufficient result of approximately 0.85 for my dataset.

My problem: If I use the RandomForestClassifier for prediction, the result is very bad (bac is around 0.45 or even lower, using the same pipeline as before just another classifier). What I see is that during training I reach a bac of 1 (perfect), but the bac of the testset is as bad as already mentioned. If we consider the fact that random guessing would give a bac of 0.25, the result looks even worse. So my problem might be overfitting. But how to solve this problem? I already checked several resources like:

https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d

The result didn't get any better. I tried: - using a different amount of n_estimators: 100, 1000, 5000, ... - using undersampling using Kmeans or oversampling using SMOTE from imblearn - criterion 'gini' or 'entropy' - different values for max_features - oob_score True / False

I have to admit that I don't really understand the properties like max_leaf_nodes mean, so I just used the default values there. I'm not looking for the perfect result, all I'm looking for is a more or less reasonable result and a start point to go on. For me, it seems that the RandomForestClassifier doesn't work here at all, like there is a bug or so (which is most probably not the case).

Do you have any suggestion what I could try to do, what parameters are the ones I have to focus on?

Below I inserted my calculation steps using a dataset from Sklearn. Please note that this is not my dataset I want to do my prediction. On this dataset, the algorithm seems to work ok.

I'm really grateful for every helpful reply.

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, balanced_accuracy_score

y = load_breast_cancer(return_X_y=True)
X = y[0]
y = y[1]
X_1, X_2, y_1, y_2 = train_test_split(X, y, test_size=0.2)

scaling = RobustScaler()
feature_selection = SelectKBest(score_func=f_classif, k=25)
estimator = RandomForestClassifier(class_weight='balanced')

X_1 = scaling.fit_transform(X_1, y_1)
X_2 = scaling.transform(X_2)
X_1 = feature_selection.fit_transform(X_1, y_1)
X_2 = feature_selection.transform(X_2)
estimator = estimator.fit(X_1, y_1)
prediction = estimator.predict(X_2)

print(balanced_accuracy_score(y_2, prediction))
jordin1987
  • 91
  • 1
  • 1
  • 4
  • Perhaps this is a situation where random forest doesn't do well https://stats.stackexchange.com/questions/112148/when-to-avoid-random-forest or your features are not very predictive https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless – Sycorax Nov 12 '19 at 13:35
  • Since the SVC works fine, I would argue that your assumption of not predictive features is not true. Regarding your first argument: What are generally situations, where the random forest doesn't do well? – jordin1987 Nov 12 '19 at 13:48
  • They're outlined in the first link. Here's another example: https://stats.stackexchange.com/questions/435329/how-to-explain-random-forest-ml-algorithm-doesnt-learn-at-all-while-logistic-r#comment811572_435329 – Sycorax Nov 12 '19 at 13:51

1 Answers1

0

Perhaps K_best with f_classif is removing features that RandomForest could leverage. Try Selecting Kbest features from RF.feature_importances_ then rerun RF with that feature set.