How to explain random forest ML algorithm doesn't learn at all, while logistic regression learns very well?

Question

My prediction task is as follows:

Use name to predict people's ethnicity (into 4 categories: "English", "French", "Chinese", and "All others") as a multiclass classification problem. The name variable is further broken down during feature engineering into 3-letter and 4-letter substrings. For example, the name "Robert" is broken down into "rob", "obe", "ber", "ert", "robe", "ober", and "bert".

I have about 5 million rows and each row represents a single person. I split the entire data into 80:20 training:test sets. Within the training set, I carved out 10% to use as development set which I run 5-fold cross validation to obtain the most optimal (hyper)parameter set for each ML algo. The ML algo I used include regularized logistic regression (LR), linear-SVC, non-linear-SVC, decision trees (DT), and random forest (RF).

I used the optimal parameter set (from the hyperparameter set that gives the best F-score from 5-fold cross-validation) and applied to train the model using the entire training set, then I used the trained model to predict and be evaluated for its the performance in test set.

The strange phenomenon I am seeing is that LR and linear-SVC's accuracy and F1-score are at high 80%, while DT and RF are extremely poor at about 50% which is around the same as the benchmark dummy predictors which have no predictive values.

I understand some ML algo should perform better than others in different problem spaces (No free lunch theorem), and the fact that LR and linear-SVC outperform non-linear-SVC which, in turn, extremely outperforms DT and RF suggests the problem is linearly separable but the boundaries are not neatly parallel to the feature axes.

However, the fact that DT/RF were incapable to learn anything at all (similar performance to random predictors) is shocking to me, especially with that much data and how LR/linear-SVC are doing quite well. I applied and evaluated the trained model on the training set, DT/RF either learnt very slightly or not at all even when evaluated in the training set which I found strange (because I thought the poor performance in test set was due to overfitting in training set).

Shouldn't DT/RF at least be able to learn somewhat since they are flexible models? Is this likely to really happen in real life (aka am I seeing a real, natural phenomenon), or is it likely I have used the DT/RF wrong? I included the codes in the following which indicated the space I allow the random search of the hyperparameter space to take place during the 5-fold cross-validation.

# Partial code to specify hyperparameter space to be searched
'LR_V1': {  'clf': LogisticRegression(),
            'param': {
                'logisticregression__solver': ['liblinear'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': np.logspace(-4, 4, 20),
                'logisticregression__tol': np.logspace(-5, 5, 20),
                'logisticregression__class_weight': [None, 'balanced'],
                'logisticregression__multi_class': ['ovr', 'auto'],
                'logisticregression__max_iter': [50, 1000, 4000, 20000],
            }},
'LR_V2': {  'clf': LogisticRegression(),
            'param': {
                'logisticregression__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
                'logisticregression__penalty': ['none', 'l2'],
                'logisticregression__C': np.logspace(-4, 4, 20),
                'logisticregression__tol': np.logspace(-5, 5, 20),
                'logisticregression__class_weight': [None, 'balanced'],
                'logisticregression__multi_class': ['ovr', 'multinomial', 'auto'],
                'logisticregression__max_iter': [50, 1000, 4000, 20000],
            }},
'SVC_LINEAR': { 'clf': OneVsRestClassifier(LinearSVC()),
                'param': {
                    'onevsrestclassifier__estimator__penalty': ['l2'],
                    'onevsrestclassifier__estimator__loss': ['hinge', 'squared_hinge'],
                    'onevsrestclassifier__estimator__C': np.logspace(-4, 4, 20),
                    'onevsrestclassifier__estimator__tol': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
                    'onevsrestclassifier__estimator__class_weight': [None, 'balanced'],
                    'onevsrestclassifier__estimator__multi_class': ['ovr', 'crammer_singer'],
                    'onevsrestclassifier__estimator__max_iter': [50, 1000, 4000, 20000],
                }},
'SVC_NONLINEAR': {  'clf': OneVsRestClassifier(SVC()),
                    'param': {
                        'onevsrestclassifier__estimator__kernel': ['poly', 'rbf', 'sigmoid'],
                        'onevsrestclassifier__estimator__C': np.logspace(-4, 4, 20),
                        'onevsrestclassifier__estimator__tol': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
                        'onevsrestclassifier__estimator__class_weight': [None, 'balanced'],
                        'onevsrestclassifier__estimator__decision_function_shape': ['ovo', 'ovr'],
                        'onevsrestclassifier__estimator__max_iter': [50, 1000, 4000, 20000],
                    }},
'RF': { 'clf': RandomForestClassifier(),
        'param': {
            'randomforestclassifier__n_estimators': [1, 8, 16, 32, 64, 100, 200, 500, 1000],
            'randomforestclassifier__criterion': ['gini', 'entropy'],
            'randomforestclassifier__class_weight': [None, 'balanced', 'balanced_subsample'],
            'randomforestclassifier__max_depth': [None, 5, 10, 20, 40, 80],
            'randomforestclassifier__min_samples_split': np.linspace(0.01, 1.0, 100, endpoint=True),
            'randomforestclassifier__min_samples_leaf': np.linspace(0.01, 0.5, 100, endpoint=True),
            'randomforestclassifier__max_leaf_nodes': [None, 10, 50, 100, 200, 400],
            'randomforestclassifier__max_features': [None, 'auto', 'sqrt', 'log2'],
        }},
'DT': { 'clf': DecisionTreeClassifier(),
        'param': {
            'decisiontreeclassifier__splitter': ['random', 'best'],
            'decisiontreeclassifier__criterion': ['gini', 'entropy'],
            'decisiontreeclassifier__class_weight': [None, 'balanced'],
            'decisiontreeclassifier__max_depth': [None, 5, 10, 20, 40, 80],
            'decisiontreeclassifier__min_samples_split': np.linspace(0.01, 1.0, 100, endpoint=True),
            'decisiontreeclassifier__min_samples_leaf': np.linspace(0.01, 0.5, 100, endpoint=True),
            'decisiontreeclassifier__max_leaf_nodes': [None, 10, 50, 100, 150, 200],
            'decisiontreeclassifier__max_features': [None, 'auto', 'sqrt', 'log2'],
        }},

# Partial code to show the random search in 5-fold cross-validation
from sklearn.model_selection import RandomizedSearchCV
pipe = make_pipeline(preprocessor, control_panel['ml_algo_param_grid'][1]['clf'])
grid = RandomizedSearchCV(pipe, param_distributions=control_panel['ml_algo_param_grid'][1]['param'], 
    n_jobs=-1, cv=5, scoring=f1_score)
grid.fit(X_dev, y_dev)

Can i ask, what is the final number of dimensions of your data set? — Kane Chua, Nov 09 '19 at 16:41
That's a good question, I don't have an actual number right now, but from the brief testing long time ago, I think it's in tens of thousands to hundreds of thousands. But I could be off. — KubiK888, Nov 09 '19 at 16:45
I think i know why. Assuming u have ~500k predictors, (26^4 4-char predictors), normally in random forest the number of random predictors chosen is sqrt(p) approx ~700 predictors. You only trained random forests up to 1000 trees, which in my opinion isnt enough to cover all 500k predictors. Have you looked at the tree convergence plots? — Kane Chua, Nov 09 '19 at 16:51
To `Kane Chua`, what is the maximal number of trees I should allow training to take place, is there a citation to that calculation? — KubiK888, Nov 09 '19 at 16:59
There is no maximal number of trees. You just have to let the trees be built until convergence of your OOB error. — Kane Chua, Nov 09 '19 at 17:02
To `Kane Chua`, I see. So I should just not specify the number of trees. (Before) I thought specifying one would somehow minimize overfitting. — KubiK888, Nov 09 '19 at 17:04
OK, now I understand we shouldn't specify the number of trees. But what about the max. depth of each tree in both DT and RF? — KubiK888, Nov 09 '19 at 17:21

Sycorax · Answer 1 · 2021-12-29T20:49:03.720

RF does very poorly when the data is highly sparse, because there's a high probability that the feature it selects to split on will be all 0s. See: When to avoid Random Forest?
Something as simple as svd or non-negative-matrix-factorization can improve RF when it recovers a useful dense representation of the sparse data. But this isn't guaranteed.
Too rich a tree (too high max depth and related parameters) can be a source of overfit, but the effect is usually small so most people just build the deepest tree and call it a day. Setting the number of features to split on is by far the most important hyper-parameter; I can't find the article that I'm thinking of right now, though.
Also, "auto" and "sqrt" do the same thing for RandomForestClassifier according to the documentation. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Using F1 score to select the model might not be sensitive enough, and could choose a bogus model. Using a strictly proper scoring rule that takes into account the full probability information is best. Some examples are Brier score and the cross-entropy.
Semi-related note: you don't have to tune the number of trees in a random forest Do we have to tune the number of trees in a random forest? Just pick a large enough number that the variance in the predictions is small.

score 1 · Answer 2 · answered Dec 29 '21 at 19:29

You've left the default value for n_iter, 10, in the search. That's way too low for most uses with more than a couple important hyperparameters, and especially if your search space contains large regions of poorly performing hyperparameter combinations.

In particular, I think your tree complexity controls are too often too strict: a small depth or large min samples per split or leaf or small max leaves will likely cause underfitting. You could jump n_iter to 60-100, shrink the ranges of those parameters to less-strict ones (or don't select uniformly), and/or just search over fewer of those similar-purpose hyperparameters. The random forests with very few trees are likely to have unstable scores, as well; better not to search that hyperparameter, and just leave it at something large-ish.

How to explain random forest ML algorithm doesn't learn at all, while logistic regression learns very well?

2 Answers2

Linked