My prediction task is as follows:
Use name to predict people's ethnicity (into 4 categories: "English", "French", "Chinese", and "All others") as a multiclass classification problem. The name variable is further broken down during feature engineering into 3-letter and 4-letter substrings. For example, the name "Robert" is broken down into "rob", "obe", "ber", "ert", "robe", "ober", and "bert".
I have about 5 million rows and each row represents a single person. I split the entire data into 80:20 training:test sets. Within the training set, I carved out 10% to use as development set which I run 5-fold cross validation to obtain the most optimal (hyper)parameter set for each ML algo. The ML algo I used include regularized logistic regression (LR), linear-SVC, non-linear-SVC, decision trees (DT), and random forest (RF).
I used the optimal parameter set (from the hyperparameter set that gives the best F-score from 5-fold cross-validation) and applied to train the model using the entire training set, then I used the trained model to predict and be evaluated for its the performance in test set.
The strange phenomenon I am seeing is that LR and linear-SVC's accuracy and F1-score are at high 80%, while DT and RF are extremely poor at about 50% which is around the same as the benchmark dummy predictors which have no predictive values.
I understand some ML algo should perform better than others in different problem spaces (No free lunch theorem), and the fact that LR and linear-SVC outperform non-linear-SVC which, in turn, extremely outperforms DT and RF suggests the problem is linearly separable but the boundaries are not neatly parallel to the feature axes.
However, the fact that DT/RF were incapable to learn anything at all (similar performance to random predictors) is shocking to me, especially with that much data and how LR/linear-SVC are doing quite well. I applied and evaluated the trained model on the training set, DT/RF either learnt very slightly or not at all even when evaluated in the training set which I found strange (because I thought the poor performance in test set was due to overfitting in training set).
Shouldn't DT/RF at least be able to learn somewhat since they are flexible models? Is this likely to really happen in real life (aka am I seeing a real, natural phenomenon), or is it likely I have used the DT/RF wrong? I included the codes in the following which indicated the space I allow the random search of the hyperparameter space to take place during the 5-fold cross-validation.
# Partial code to specify hyperparameter space to be searched
'LR_V1': { 'clf': LogisticRegression(),
'param': {
'logisticregression__solver': ['liblinear'],
'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': np.logspace(-4, 4, 20),
'logisticregression__tol': np.logspace(-5, 5, 20),
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__multi_class': ['ovr', 'auto'],
'logisticregression__max_iter': [50, 1000, 4000, 20000],
}},
'LR_V2': { 'clf': LogisticRegression(),
'param': {
'logisticregression__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
'logisticregression__penalty': ['none', 'l2'],
'logisticregression__C': np.logspace(-4, 4, 20),
'logisticregression__tol': np.logspace(-5, 5, 20),
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__multi_class': ['ovr', 'multinomial', 'auto'],
'logisticregression__max_iter': [50, 1000, 4000, 20000],
}},
'SVC_LINEAR': { 'clf': OneVsRestClassifier(LinearSVC()),
'param': {
'onevsrestclassifier__estimator__penalty': ['l2'],
'onevsrestclassifier__estimator__loss': ['hinge', 'squared_hinge'],
'onevsrestclassifier__estimator__C': np.logspace(-4, 4, 20),
'onevsrestclassifier__estimator__tol': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
'onevsrestclassifier__estimator__class_weight': [None, 'balanced'],
'onevsrestclassifier__estimator__multi_class': ['ovr', 'crammer_singer'],
'onevsrestclassifier__estimator__max_iter': [50, 1000, 4000, 20000],
}},
'SVC_NONLINEAR': { 'clf': OneVsRestClassifier(SVC()),
'param': {
'onevsrestclassifier__estimator__kernel': ['poly', 'rbf', 'sigmoid'],
'onevsrestclassifier__estimator__C': np.logspace(-4, 4, 20),
'onevsrestclassifier__estimator__tol': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
'onevsrestclassifier__estimator__class_weight': [None, 'balanced'],
'onevsrestclassifier__estimator__decision_function_shape': ['ovo', 'ovr'],
'onevsrestclassifier__estimator__max_iter': [50, 1000, 4000, 20000],
}},
'RF': { 'clf': RandomForestClassifier(),
'param': {
'randomforestclassifier__n_estimators': [1, 8, 16, 32, 64, 100, 200, 500, 1000],
'randomforestclassifier__criterion': ['gini', 'entropy'],
'randomforestclassifier__class_weight': [None, 'balanced', 'balanced_subsample'],
'randomforestclassifier__max_depth': [None, 5, 10, 20, 40, 80],
'randomforestclassifier__min_samples_split': np.linspace(0.01, 1.0, 100, endpoint=True),
'randomforestclassifier__min_samples_leaf': np.linspace(0.01, 0.5, 100, endpoint=True),
'randomforestclassifier__max_leaf_nodes': [None, 10, 50, 100, 200, 400],
'randomforestclassifier__max_features': [None, 'auto', 'sqrt', 'log2'],
}},
'DT': { 'clf': DecisionTreeClassifier(),
'param': {
'decisiontreeclassifier__splitter': ['random', 'best'],
'decisiontreeclassifier__criterion': ['gini', 'entropy'],
'decisiontreeclassifier__class_weight': [None, 'balanced'],
'decisiontreeclassifier__max_depth': [None, 5, 10, 20, 40, 80],
'decisiontreeclassifier__min_samples_split': np.linspace(0.01, 1.0, 100, endpoint=True),
'decisiontreeclassifier__min_samples_leaf': np.linspace(0.01, 0.5, 100, endpoint=True),
'decisiontreeclassifier__max_leaf_nodes': [None, 10, 50, 100, 150, 200],
'decisiontreeclassifier__max_features': [None, 'auto', 'sqrt', 'log2'],
}},
# Partial code to show the random search in 5-fold cross-validation
from sklearn.model_selection import RandomizedSearchCV
pipe = make_pipeline(preprocessor, control_panel['ml_algo_param_grid'][1]['clf'])
grid = RandomizedSearchCV(pipe, param_distributions=control_panel['ml_algo_param_grid'][1]['param'],
n_jobs=-1, cv=5, scoring=f1_score)
grid.fit(X_dev, y_dev)