Why is my stacking/meta-learning not outperforming the best base model?

Question

I have a dataset of around 10,000 rows, with 500 features, response variable is binary classification.

I split the features into 5 equally sized groups (based on subject matter expertise), and trained 3 different models (RandomForest, XGBoost, SVM) on each of the 5 equally sized groups.

So now I have 15 different models. I then trained a single (RandomForest) model on the outputs of those 15 (the probability outputs, not hard predictions).

The results were surprising to me: The meta learner (single 2nd level RF model) did not do any better than the average individual "base" learner result. In fact, there were some base level models that did better than the meta-learner. I'm wondering why this could be?

I'm not very experienced with stacking, so I thought perhaps there are some general tips/strategies/techniques that I'm missing.

General Note:

Each of the base models does significantly better than baseline.
I give the meta-learner access to some of the base features + predictions (probability outputs from base learners). I don't give meta learner access to ALL base features because I've found that even base learners that are given all 500 features don't perform better than the base learners that get 100/500 of the features, so I don't think the algorithms can handle all features at once (maybe not enough rows for a single learner to be able to learn all relationships between all features?).
I looked at the inter-model agreement among the base learners. As mentioned, it is a binary classification task, and each of the base learners achieves 65-70% accuracy (baseline is around 50%). Now, I looked at how much overlap there is between base models in the predictions they get right and wrong (presumably, if the predictions that they got right are all the same, there's no chance for meta-learner to combine them, since their value is the same). The base models generally get the same examples right and wrong 70-80% of the time. I don't know if this is high or low, but perhaps this could be part of the key. If it is, please explain in detail, as 20-30% difference still seems like a good amount of potential improvement in combination (unless that's random noise).

Info about my train/test splitting:

I am using nested cross validation to generate the level 1 predictions.
I am also using nested cross validation to train, test and evaluate the level 2 model.
I am working with time series data, so I organized my cross validation in a way to try to avoid using "future" training data and "past" validation data. Here is a diagram that shows the general structure of my outer nested CV folds. My inner CV folds look the same way.

I for once suggest to simplify your model initially and not use 15, but only 5 weak base learners (i.e XGBoost only) and then combine them. Naively I would also not want to exclude simple programming errors. Finally what I am missing in your question is information about the data split for test / train / evaluation. Are you doing crossvalidation? — Nikolas Rieble, Jan 23 '22 at 19:45
@NikolasRieble How would you recommend combining the base learners, and why would you use less models? I am using an RF model as second level learner, and while I can try it out, I struggle to see how decreasing number of base learners would help. Do you disagree? What kinds of programming errors might you expect? I added an edit in my original question, but I am using nested cross validation to generate the level 1 predictions, yes. — Vladimir Belik, Jan 23 '22 at 19:52
You could try a variant where the ensemble algorithm has access to all base features + the predictions of the 1st level algorithms. I would use less models because whenever I do not understand something, I try to reduce complexity. Programming errors could be related to test-train split for example. I wonder which data you use for training the ensemble algorithm. — Nikolas Rieble, Jan 23 '22 at 19:56
@NikolasRieble I do give the ensemble algorithm access to some base features + predictions. The reason I don't give ALL base features is because I found that even level 1 models trained on ALL 500 features don't perform better than level 1 models trained on 100/500 of the features. I think the algorithms can't handle all variables at once (maybe not enough rows?). I'll give reducing complexity a try, thank you for your suggestion. What do you mean by "which data you use for training the ensemble algorithm"? — Vladimir Belik, Jan 23 '22 at 20:03
You use nested cross validation for level 1. And how do you train level 2? You forgot to share information about level 2 training. — Nikolas Rieble, Jan 23 '22 at 20:21
@NikolasRieble I apologize for that lack of info. I'm also using nested cross validation to train and test/evaluate the level 2 model. Same procedure, just feeding in level 1 outputs instead of all variables into level 2 model. — Vladimir Belik, Jan 23 '22 at 20:33
@NikolasRieble I've put a bounty on the question, in case you are interested in responding. — Vladimir Belik, Jan 26 '22 at 15:11
Are the base predictions hard classifications, or soft (probabilities or other continuous confidence measure)? How exactly is the nested cross-validation being performed? What kind of features does the meta-classifier have access to? Have you tried anything other than a random forest as the meta-estimator? — Ben Reiniger, Jan 26 '22 at 18:48
@BenReiniger Base predictions are probabilities (from RF, XGB and SVM). Nested cross validation is being performed in a way slightly adapted to time series data (as it is a time series problem), but basically just standard nested CV. 5 outer folds, each with 5 inner folds. Since it it time series, I just order the folds in a way to somewhat preserve chronological order. As I mentioned in my question, meta-classifier has access to base model predictions + a couple dozen base features I thought might be helpful. I have tried XGB as meta-estimator as well, no luck. — Vladimir Belik, Jan 26 '22 at 19:05
@BenReiniger I edited my question with a diagram to better convey my CV process. — Vladimir Belik, Jan 26 '22 at 19:09
I haven'tread your specific problem but I saw a lot of similar questions in the past days. You should check out the `stacking` tag that you put. I think there are really interesting discussions in some of these questions. — Marcel Braasch, Jan 26 '22 at 19:39
@MarcelBraasch Thank you for the suggestion, I'll check it out. — Vladimir Belik, Jan 26 '22 at 19:50
What is the nested cv though; are you tuning hyperparameters? What becomes of the earliest training examples, which never have predictions made by base models? An early suggestion: try a logistic regression for the meta-estimator. — Ben Reiniger, Jan 27 '22 at 04:42
@BenReiniger Yes, I am tuning hyperparameters and running feature selection within the nested CV. I am doing it on base level to get out of sample predictions (to feed to 2nd lvl model), and doing it on 2nd lvl model to get accurate error estimation. Great question. Here's what I'm doing: If you refer to the diagram I provided, my earliest "test set" point for base models is at row 4500/1000 . However, I don't want to leave the first 4500 rows without predictions (like you mention). So, I split the first 4500 into 2 folds, and run CV there (explained in next comment). — Vladimir Belik, Jan 27 '22 at 04:50
@BenReiniger So to be clear, for the first 4500, I split it in half, I run CV on first half and use second half as "test set" to generate out of sample base predictions for those points, then I repeat the procedure, vice versa, using the 2nd half as training/validation and 1st half as "test", so that I have OOS predictions there too. I thought that way, I would have "OOS" predictions for entire dataset so 2nd lvl model can use all that data. I haven't tried LR as the meta-learner, I'll give it a shot. My concern with LR was that I thought 15 variables might be too much for it. — Vladimir Belik, Jan 27 '22 at 04:51
@BenReiniger If you have questions/concerns/issues with what I'm doing with those first 4500 points in terms of this weird split I'm doing, please share. I have suspected that it might be the issue, but I'm not sure because the OOS performance on both halves is above baseline (though there's a noticeable difference - one of the fold performances is better than the other, even taking into account the "expected" performance from CV). It's something like maybe 65% on one, and 58-60% for the other. Not huge, but consistently different. I figured since it's better than baseline, it's not the issue. — Vladimir Belik, Jan 27 '22 at 04:55
I'm so sorry to bombard with comments, I just noticed an error: I meant to say that the first test set point is at 4,500/10,000 not 4,500/1,000 . — Vladimir Belik, Jan 27 '22 at 05:00

score 2 · Accepted Answer · answered Jan 31 '22 at 15:15

There's a lot of useful material and practical recommendations for stacking from the Kaggle community. These include:

Often simple ensembling models (e.g. simple average of predicted probabilities, weighted average of predicted probabilities or logistic regression of logits - possibly with regularization towards a simple average) perform a lot better than trying to fit fancier models on the second level. This is especially the case when there's little data (esp. with time series data it's hard to know how much data 10,000 rows really is, because of the dependency of data over time and across different observations at the same time). XGBoost would probably not be my first thought for a stacking model, unless there's a lot of data.
Giving the ensembling model access to base features can be valuable when there is a lot of data and depending on some features the predictions of one model should be trusted more than the predictions of another model (of course, the latter is hard to know up-front). However, one often has to worry about this making it too easy to overfit and not adding value, so I would consider not doing that.
Ensembling models also need good hyperparameter choices and often need to be heavily regularized to avoid overfitting. How much so? Hard to say without a good validation set-up. If you do not have a good validation set-up for them, yet, this needs thought. A form of time-wise-splits like you seem to be using would be a sensible approach - e.g. you could use what you show in red as a the validation data on which you fit your ensembling models and then validate based on going even further into the future (perhaps you are already doing that?).
Have a look at the chapter on this in Kaggle GM Abishek Thakur's book (page 272 onwards) or - if you have easy access - there's excellent sections on ensembling and the validation schemes for it in the "How to Win a Data Science Competition" Coursera course, as well as what various Kagglers have written on ensembling (e.g. here or simply by looking at the forum discussions and winning solutions posted for a Kaggle competition that resembles your particular problem).

Why am I emphasizing Kaggle (and similar data science competition settings) so much? Firstly, because a lot of ensembling stacking gets done there. Secondly, because a lot of good incentives exist there to ensure that ensembling is not overfit and performs well on the unseen test data (while pracitioners sometimes fool themselves into believing in overfit results that are evaluated in an unreliable manner). Of course, there's also incentives to do things that would not work in practice, but might work well in the competition (like exploiting target leaks).

70-80% agreement between models that perform comparably (although a difference in accuracy of 65 vs. 70% seems large) actually sounds like a promising scenario for ensembling in the sense that this is on the low side for models trained on the same data. This reflects the reasonable diversity of models you chose (I'd expect a lot more similar results if you e.g. used XGBoost and LightGBM). Having models that are too similar in nature e.g. just multiple XGBoosts with slightly different hyperparameter values is usually much less valuable. Perhaps even more diversity could be achieved by having more models e.g. kNN classifier, logistic regression (both might require some good feature engineering) and depending on the details (that will determine whether there's any hope to do this - e.g. high cardinality categorical features, some inputs are text/images, being able to feed the time series nicely into a LSTM etc.) neural networks of some form (e.g. LSTM type).

Thank you for your response. I'll certainly read through the Kaggle content you linked. From your answer though, overall, it seems like the only big actionable suggestions that I haven't tried yet is using a simpler level 2 model, and adding some KNN and LR into base models for additional diversity. I will certainly try these things. The reason I didn't want to use LR for lvl 2 model is precisely for the reason you identified in your 2nd point - I am hoping for dependencies b/t features and base model performance to be found, which I think LR would not find. But I will try. — Vladimir Belik, Jan 31 '22 at 16:30
I am confident in my nested CV setup, I think everything is correct there. Broadly speaking, in this kind of situation (reasonable amount of data, yet lackluster stacking performance), you have outlined several things to try - increase base model diversity even more, try variety of lvl 2 models. Let's say I've exhausted those options and the situation is the same. What would you recommend? Is there a point at which you say "stacking just won't help here"? If so, do I just conclude that the base model "dis-agreement" is due just to noise, and there's no additional performance left on the table? — Vladimir Belik, Jan 31 '22 at 16:33
Additionally, I'm reading your link and they refer to a model type with an acronym "ET". What does that stand for? — Vladimir Belik, Jan 31 '22 at 19:08
I'm wondering if the amount of data is effectively less than you hope due to the correlated nature of time series. That may favor more basic models, even if with more data complicateed models might be better. Stacking/ensembling is tricky business because of the heightened overfitting risk it entails. I really like trying a weighted average. One nice thing with that is that if the optimal weights for all but one model are 0, this gives you a hint that one model might be too much better than the others for ensembling to work. I'm not sure what ET is, possibly ExtraTreesClassifier/Regressor. — Björn, Jan 31 '22 at 23:01
I've differenced and otherwise set up my data so that there shouldn't huge big autocorrelation issues, but I definitely understand your logic. How do you suggest going about doing a weighted average? Especially a weighted average that will make some weights 0? Are you talking about LASSO regression? — Vladimir Belik, Feb 01 '22 at 00:45
Let's say you have N different models, then you basically have N-1 weights to determine (and the final one is 1 - the sum of the others). Since optimization usually works best on an unconstrained space, you work with values on the -infinity to infinity scale and turn this into weights as w = exp( log_softmax( vector(x, -sum(x))). We then use sum(w*predictions) in the binary log-loss. You can also penalize (e.g. L2 penalty) weights for deviating from 1/N or vector(x, -sum(x)) for deviating from each other. This can then be done with standard minimization functions such as ucminf::ucminf() in R. — Björn, Feb 01 '22 at 14:17
Very interesting, I've never tried even anything remotely similar approach for weighting the models (I always thought I'd need another model to do that). Do you happen to have a link with something like this done in more detail? — Vladimir Belik, Feb 01 '22 at 16:34
It is a model, just with very particular constraints on the model parameters. I've seen a few examples of this in Kaggle notebooks that people have shared. Except for the regularization bit, the [book chapter I linked](https://github.com/abhishekkrthakur/approachingalmost) discusses weighted averages. With regularization, I've used that a few times, but don't have a great example at hand. Here's [one example where I describe a closely related PyTorch approach for one competition](https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification/discussion/226684) (see ensembling models). — Björn, Feb 01 '22 at 17:45

Why is my stacking/meta-learning not outperforming the best base model?

1 Answers1