Scikit-learn has an example where it compares different "ensembles of trees" methods for classification on slices of their iris dataset. Being new to machine learning and having seen XGBoost pop everywhere, I decided to expand this example and include both scikit-learn's GradientBoostingClassifier
and XGBClassifier
for comparison. The code is (note that aside from adding two additional models, this code is taken directly from the example linked above)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
AdaBoostClassifier,GradientBoostingClassifier)
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
# Parameters
n_classes = 3
n_estimators = 30
RANDOM_SEED = 13 # fix the seed on each iteration
# Load data
iris = load_iris()
models = [DecisionTreeClassifier(max_depth=None),
RandomForestClassifier(n_estimators=n_estimators),
ExtraTreesClassifier(n_estimators=n_estimators),
AdaBoostClassifier(DecisionTreeClassifier(max_depth=None),
n_estimators=n_estimators),
GradientBoostingClassifier( n_estimators=n_estimators, max_depth=None, learning_rate=0.1),
XGBClassifier( n_estimators=n_estimators, max_depth=10, eta=0.1)]
for pair in ([0, 1], [0, 2], [2, 3]):
for model in models:
# We only take the two corresponding features
X = iris.data[:, pair]
y = iris.target
# Shuffle
idx = np.arange(X.shape[0])
np.random.seed(RANDOM_SEED)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]
# Standardize
mean = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mean) / std
# Train
model.fit(X, y)
scores = model.score(X, y)
# Create a title for each column and the console by using str() and
# slicing away useless parts of the string
model_title = str(type(model)).split(
".")[-1][:-2][:-len("Classifier")]
model_details = model_title
if hasattr(model, "estimators_"):
model_details += " with {} estimators".format(
len(model.estimators_))
print(model_details + " with features", pair,
"has a score of", scores)
The results are
DecisionTree with 30 estimators with features [0, 1] has a score of 0.9266666666666666
RandomForest with 30 estimators with features [0, 1] has a score of 0.9266666666666666
ExtraTrees with 30 estimators with features [0, 1] has a score of 0.9266666666666666
AdaBoost with 30 estimators with features [0, 1] has a score of 0.9266666666666666
GradientBoosting with 30 estimators with features [0, 1] has a score of 0.9266666666666666
XGB with 30 estimators with features [0, 1] has a score of 0.8933333333333333
===
DecisionTree with 30 estimators with features [0, 2] has a score of 0.9933333333333333
RandomForest with 30 estimators with features [0, 2] has a score of 0.9933333333333333
ExtraTrees with 30 estimators with features [0, 2] has a score of 0.9933333333333333
AdaBoost with 30 estimators with features [0, 2] has a score of 0.9933333333333333
GradientBoosting with 30 estimators with features [0, 2] has a score of 0.9933333333333333
XGB with 30 estimators with features [0, 2] has a score of 0.9733333333333334
===
DecisionTree with 30 estimators with features [2, 3] has a score of 0.9933333333333333
RandomForest with 30 estimators with features [2, 3] has a score of 0.9933333333333333
ExtraTrees with 30 estimators with features [2, 3] has a score of 0.9933333333333333
AdaBoost with 30 estimators with features [2, 3] has a score of 0.9933333333333333
GradientBoosting with 30 estimators with features [2, 3] has a score of 0.9933333333333333
XGB with 30 estimators with features [2, 3] has a score of 0.9866666666666667
As you can see, the other methods all report the same results with XGBoost being slightly lower. I obviously have not done any sort of model optimization, but I am wondering if there is a reason why XGBoost does not perform as well in this simple situation? Is it too artificial of an example for the benefits of XGBoost to become apparent? Did I set things up in a manner that would disadvantage XBGoost (this is my first time using any of these algorithms)? Thanks in advance!