XGBoost Compared to Other Ensemble Methods Example

Question

Scikit-learn has an example where it compares different "ensembles of trees" methods for classification on slices of their iris dataset. Being new to machine learning and having seen XGBoost pop everywhere, I decided to expand this example and include both scikit-learn's GradientBoostingClassifier and XGBClassifier for comparison. The code is (note that aside from adding two additional models, this code is taken directly from the example linked above)

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
                              AdaBoostClassifier,GradientBoostingClassifier)
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Parameters
n_classes = 3
n_estimators = 30
RANDOM_SEED = 13  # fix the seed on each iteration

# Load data
iris = load_iris()

models = [DecisionTreeClassifier(max_depth=None),
          RandomForestClassifier(n_estimators=n_estimators),
          ExtraTreesClassifier(n_estimators=n_estimators),
          AdaBoostClassifier(DecisionTreeClassifier(max_depth=None),
                             n_estimators=n_estimators),
          GradientBoostingClassifier( n_estimators=n_estimators, max_depth=None, learning_rate=0.1),
          XGBClassifier( n_estimators=n_estimators, max_depth=10, eta=0.1)]

for pair in ([0, 1], [0, 2], [2, 3]):
    for model in models:
        # We only take the two corresponding features
        X = iris.data[:, pair]
        y = iris.target

        # Shuffle
        idx = np.arange(X.shape[0])
        np.random.seed(RANDOM_SEED)
        np.random.shuffle(idx)
        X = X[idx]
        y = y[idx]

        # Standardize
        mean = X.mean(axis=0)
        std = X.std(axis=0)
        X = (X - mean) / std

        # Train
        model.fit(X, y)

        scores = model.score(X, y)
        # Create a title for each column and the console by using str() and
        # slicing away useless parts of the string
        model_title = str(type(model)).split(
            ".")[-1][:-2][:-len("Classifier")]

        model_details = model_title
        if hasattr(model, "estimators_"):
            model_details += " with {} estimators".format(
                len(model.estimators_))
        print(model_details + " with features", pair,
              "has a score of", scores)

The results are

DecisionTree with 30 estimators with features [0, 1] has a score of 0.9266666666666666
RandomForest with 30 estimators with features [0, 1] has a score of 0.9266666666666666
ExtraTrees with 30 estimators with features [0, 1] has a score of 0.9266666666666666
AdaBoost with 30 estimators with features [0, 1] has a score of 0.9266666666666666
GradientBoosting with 30 estimators with features [0, 1] has a score of 0.9266666666666666
XGB with 30 estimators with features [0, 1] has a score of 0.8933333333333333
===
DecisionTree with 30 estimators with features [0, 2] has a score of 0.9933333333333333
RandomForest with 30 estimators with features [0, 2] has a score of 0.9933333333333333
ExtraTrees with 30 estimators with features [0, 2] has a score of 0.9933333333333333
AdaBoost with 30 estimators with features [0, 2] has a score of 0.9933333333333333
GradientBoosting with 30 estimators with features [0, 2] has a score of 0.9933333333333333
XGB with 30 estimators with features [0, 2] has a score of 0.9733333333333334
===
DecisionTree with 30 estimators with features [2, 3] has a score of 0.9933333333333333
RandomForest with 30 estimators with features [2, 3] has a score of 0.9933333333333333
ExtraTrees with 30 estimators with features [2, 3] has a score of 0.9933333333333333
AdaBoost with 30 estimators with features [2, 3] has a score of 0.9933333333333333
GradientBoosting with 30 estimators with features [2, 3] has a score of 0.9933333333333333
XGB with 30 estimators with features [2, 3] has a score of 0.9866666666666667

As you can see, the other methods all report the same results with XGBoost being slightly lower. I obviously have not done any sort of model optimization, but I am wondering if there is a reason why XGBoost does not perform as well in this simple situation? Is it too artificial of an example for the benefits of XGBoost to become apparent? Did I set things up in a manner that would disadvantage XBGoost (this is my first time using any of these algorithms)? Thanks in advance!

I see that you made a list of models and then you got a list of scores. But without information about (1) what the scores are and (2) how the models were estimated and (3) what the scores measure (are they in-sample? out-of-sample?) it is not possible to answer this question. But it's well-known that these models are sensitive to the hyper-parameter configurations, so there's no reason to believe that these hyper-parameters will yield good models. Finding that an `xgboost` model without tuning is worse than another model without tuning is kind of like flipping coins and counting the # of heads — Sycorax, Nov 02 '20 at 17:26
I modified a single line of scikit-learn's code (that was linked to) and didn't feel the need to repost all of their code, but have done so now. Evidently, the answer I was looking for is "this is not a good exercise without hyper-parameter tuning," so thanks for that, at least. — HeorotsHero, Nov 02 '20 at 17:50
It's best to reproduce all the relevant content to your question because link content can change, or links can die. It also helps you get useful replies, because digging through links to find the part pertaining to the question is time-consuming and deters answerers. — Sycorax, Nov 02 '20 at 17:51
I have heard that the boosters need aggressive tuning/polish/cv to get world-class performance. Out of the box isn't going to give good. Also, my personal defaults are 1500 trees and a learning rate of 0.001. It is, in my personal opinion, misinformation on the part of competitors that 30 trees in a GBM is a good idea. — EngrStudent, Nov 02 '20 at 17:59
Thanks, @EngrStudent. Your experience regarding defaults for a GBM is helpful and some of the insight I was looking for. Upping the number of trees in line with your experience did even out the results of the models, which is what I was expecting for such a simple problem. I'd select this as my answer if I could (if you repost, I'd do so as well). — HeorotsHero, Nov 02 '20 at 20:01
I feel like all of these answers are ridiculously overblown. The problem is that `n_estimators` performs a different role in the different algorithms, despite having the same name. As you noticed, xgboost performs better if you increase the number of trees. The answers to your questions are "yes" and "yes"! — Flounderer, Nov 04 '20 at 08:04
@Flounderer - I had fun. It was good for my soul. I've wanted to do it for some time. At some point I would like to regularly create answers at the quality level of Walt Flom or the other greats, and I can only get there by practicing. — EngrStudent, Nov 04 '20 at 13:21

Sycorax · Answer 1 · 2020-11-02T21:39:06.073

These models -- random forest, xgboost, etc -- are extremely sensitive to the hyper-parameter configurations, so there's no reason to believe that these hyper-parameters will yield good models. For xgboost, the number of trees and the learning rate are two examples of hyper-parameters which require tuning. Both have a strong effect on the model.

Also, your score measurements are only applied to the in-sample data (the data used to train the model). Because all models can either exhibit overfitting or under-fitting to the training data, its important to measure performance against a hold-out.

If I recall correctly, the score method for all of these models implements accuracy, which is not the best choice of measurement for a classification model. See: Why is accuracy not the best measure for assessing classification models?

Also, it's not clear what you wish to achieve by limiting consideration to only 2 features. The procedure used here is not a great way to test inclusion or exclusion of features; for more information about feature selection, see feature-selection.

EngrStudent · Accepted Answer · 2020-11-04T03:13:07.763

@Sycorax is very capable, so he is technically quite correct. This answer is more of an elaboration of a comment that supports his main assertions.

Disclaimer: This is a very weak "tuning" so while it shows the concept it is nowhere near optimal, and will pretty strongly over-estimate the number of trees you need.

I have thought that the Gradient Boosted Machine (GBM) settings that one is exposed in some simple searches and introductions to machine learning were easy to show, but generalize to practice quite poorly. Evidence of this is that you are using 30 estimators, and a learning rate of 0.1, and you are applying to to the classic toy "Iris" dataset to compare/contrast tree-based learners against each other.

Motivations:

Random Forest needs at least 50 trees to converge, and sometimes up to 250. It is much more robust than GBM, so GBM should require many more trees, not many fewer. I would start exploring at 5x, and maybe go up to 35x more trees for a gbm than for a random forest.
GBM is supposed to beat other, much simpler learners. In doing that several times the only combinations of the control parameters that worked were high tree count and low learning rate.
GBM is supposed to handle areas of high slope in the surface its representing with less discontinuity, which requires more steps of smaller size. This requires either more depth per tree, or more trees. It also requires a small step-size between the discretized regions, which means a low learning rate.

I respect and admire the work of Hadley Wickham. Lets use a learner, input x and y coordinates, and estimate grayscale Hadley. This is a decent exercise because humans are engineered to look at faces. The micro-expression detection and gaze orientation detection that humans can determine from other humans is amazing.

(Aside) One of my problems with random "forests" is that if you only need 100-200 trees then it is really a grove. A biological (tropical/temperate/boreal) forest can have (and need) 20k trees, and you can walk for miles and see great diversity in trees. It is a grove. We are calling a it a forest but its a grove.

So lets do the basic and make a list of x, y and grayscale intensities, and see what a random forest does in reproducing it. I updated to 'h2o.ai' and used 200 trees, 2 folds. H2O.ai allows a consistent framework for side-by-side of RandomForest vs. GBM.

If we want to see it in action we need several things including imperfect inputs i.e. noise, and more input columns. The data gets augmented by centering the x and y pixels, and then converting from cartesian to polar, and adding some small gaussian-distributed noise.

We have our own Hadley-grove, or forest if you must call it that. You can observe that it averages, blurs. Fine detail like the shine of his eyes, or non-axis aligned edges of his hair or collar are lost. The CART, the base learner, is axis-aligned, so it takes more samples to do a diagonal than a horizontal. For the error, darker means more error. The mean absolute error on the holdout is 5.3%.

So using the same settings and data, but with default of 30 estimators, lets see what we get with a gbm that has a learning rate of 0.1.

It is slightly worse. It's not only not stunning, it isn't very competitive. So lets take the hobbles off the learners, and go more all-out. The ideal fit is going to have salt-and-pepper only error, nothing the eyes determine as structural. If you can see a facial feature in the error, then model isn't capturing it.

Here is what 1000 trees in each gives:

The random forest is crushing it, its mean absolute error is meaningfully less than that of the GBM. Hadley is not a mine-craft block-person, not tailored to the random-forest learner, so what is going on? It is actually a problem slightly more tailored for averaging like you get in an RF, but we aren't saying that too loudly.

Also, this is where the "tuning" comes in. Yes it needs tuning, so if I put in default values it shouldn't work so well. You can see it not working so well.

Here is what a sweep of learning rate at 200 trees gets us. Remember that smaller stepsize is to the left. This has a clear minimum, a best place, between -1.0 and -0.5 on the x-axis. The better stepsize is perhaps 0.2. It is not exceeding the random forest.

Here is what (a relatively limited) grid search on number of trees and learning rate gets us:

It is pretty clear to see that for higher level of learners there is a clear trough, and that the minimum error level tends to go down as the number goes up.

So looking at the data gives me this table:

So, for Hadley, each 5x increase in learners reduces error by a decreasing but consistently non-zero amount. This is why I like multiple ways of attacking the problem: there is noise in the process, so the numeric "minimum" isn't necessarily the true general minimum. When you look at the plot of error vs. learning rate for the 5k size GBM, you can see that values of $10^{-2.5}$ and $10^{-0.9} are within the bands for the same level of error. That is ~1.5 decades of "might be the same" which is also "the treasure might be here somewhere" where treasure is the spot you seek.

It is far too few samples, but here is a barely plausible chart suggesting that it is an exponential decay.

That suggests, maybe, that there is a point of diminishing returns, but you can figure out how far you can get from an ideal with some experimentation and algebra. You might also estimate the error with infinite samples.

Things to remember:

Consistently outperforming the next guy by 1%, especially when you are at the "last mile" in machine learning and the previous guy is 98.5% accurate, might not look big, but it is a lot.
These learners are used in places other than production such as in teasing out the "physics" aka "mechanics" aka "mechanisms" aka "phenomenology" of the phenomena of interest, and after you understand it, you can make a much (much!!) simpler system to do the same job.
Dials not yet touched include CART controls (leaves per tip, max depth, ...), and some advanced ensemble controls (rates of columns dropout, rates of row dropout, ...). You should consider these when doing your grid search.

Coming soon.

Next steps (to-do, sorry I'm out of time)

Maybe share something novel about gbm's.. (or not)

XGBoost Compared to Other Ensemble Methods Example

2 Answers2