Can I (justifiably) train a second model only on the observations that a previous model predicted poorly?

Question

Say I commit the following sins while building a predictive model:

I take my dataset and split it into four subsets: Three for training
(Train_A, Train_B, and Train_C) and one for validation.
I train an initial model (Model_A) on Train_A. Because the goal is to maximize out-of-sample prediction accuracy, I use bias-variance-balancing techniques like cross-validation.
I generate predictions from Model_A on Train_B and record the prediction errors.
Next, I train a second model (Model_B) on Train_B, but I weigh the observations based on the magnitude of the prediction errors from Model_A. In other words, Model_B is told to focus most on learning to predict the observations that Model_A was really bad at predicting. Again, the goal is out-of-sample accuracy, so a technique like cross-validation is used.
I generate predictions from Model_A and Model_B on Train_C. These are used to explore the best way to combine the predictions (e.g., weighted average) from both models to (hopefully) increase out-of-sample prediction accuracy.
After determining the best way to weigh the predictions from Model_A and Model_B, I estimate the out-of-sample accuracy using the validation set.

Main Question: Am I damned? Is this approach inherently and irrecoverably prone to overfitting? Or, is there a way to use the errors from Model_A to inform how Model_B is trained in such a way that the strengths of Model_B address the weaknesses of Model_A?

Secondary Questions: Are there particular techniques or algorithms that are better/worse at extracting value from this kind of approach? For example, I wouldn't be surprised if there are some NN techniques that inherently do this kind of thing and, therefore, wouldn't benefit at all from this approach whereas something less flexible (like regularized regression) could potentially benefit greatly in comparison. What other thoughts or advice would you provide to someone who wishes to take this approach?

Thank you!

[Edit: I feel like I walked into a McMenamins and pitched the idea of a Microbrewery to the bartender, haha! Thanks everyone for your very kind and helpful comments!]

See boosting algorithms. They do more or less what you describe. Typically, see AdaBoost https://fr.wikipedia.org/wiki/AdaBoost . — TMat, May 05 '21 at 14:59
Thanks! My assumption was that I couldn't possibly be the first to think of this. However, I am curious if there are any red flags in the approach I've outlined. It is nice to know that there are algorithms that already do this, but can the approach I described above be manually applied to any algorithm? — Jdclark, May 05 '21 at 15:03
Geez I feel pretty silly now after looking into boosting. Turns out the basic answer to my question wasn't easily found here because the answer is so obvious lol. Still, I'm curious if I there are any special considerations I should make if I were to take the approach manually. I see there are algorithms that already do this, but do you see any red flags in my THINKING? — Jdclark, May 05 '21 at 15:16
What you call validation, I would call testing, while what you are doing taking a model from TrainA and predicting TrainB etc. and then adjusting and tuning the model is what I would call validation. So be aware when you read other people, they may use slightly different language while doing similar things — Henry, May 05 '21 at 15:24
I encounter enough people using test/validation/hold-out interchangeably that its easy to get sloppy with these. At least anyone who understands the value of a "test-set" will know exactly what I mean if I call it by the wrong name. Still, being less sloppy is always a good thing. Thanks for pointing this out! — Jdclark, May 05 '21 at 16:13
In addition to all the responses about boosted tree models (such as xgboost, LightGBM etc.), which admittedly don't work with iterative sets of of datasets, there is indeed a neural network that tries to marry boosting on residual errors: TabNet (https://arxiv.org/abs/1908.07442). — Björn, May 05 '21 at 16:29
Thanks for the link! Can you elaborate (or link to discussion) about what you mean when you say boosted tree models don't work with iterative sets of data? — Jdclark, May 05 '21 at 16:30
Definitely let @Björn's response outweigh mine but I believe he means that they use the training data to update the weights at each round whereas your scheme would use a holdout set to update. So it is just a single dataset used in theirs (I guess technically they create the 'residual' dataset at each iteration but only one true training set). Validation sets are only used to stop the iterations, like if the validation set error begins increasing then boosting is stopped. If we use the holdout set for updates then we just overfit the holdout. — Tylerr, May 05 '21 at 17:15
Well said @Tylerr. To be fair, that's the same with xgboost, LightGBM and similar approaches. Who knows whether splitting your data an iteratively doing things on different parts might not be a good idea? In some sense these boosting approaches do something like that by randomly sub-sampling records (and features) at each iteration, but that's not quite the same. The interesting thing with TabNet is that's all done in one fitted neural network instead of the iterative approach of the typical boosted tree approach. — Björn, May 05 '21 at 21:13

Tim · Accepted Answer · 2021-05-06T08:09:13.793

21

As noticed in the comments, you’ve re-discovered boosting. Nothing wrong with this approach, but usually it’s easier and safer to use a method already implemented and battle-tested by someone else than starting from scratch. If you really want to use your approach, I’d encourage you to first use some out-of-the-box implementation of boosting (AdaBoost, XGBoost, CatBoost, etc) to use it as a benchmark.

edited May 06 '21 at 08:09

answered May 05 '21 at 15:27

Tim

108,699
20
212
390

3

(+1) This also gives you some idea of the performance you can hope to expect with your own method. – Frans Rodenburg May 05 '21 at 15:43
Thanks so much for your help! Of course, I learned about "boosting" in school but never engaged with it enough to remember it. So I SHOULD know this and I'm a little embarrassed. Still, I have hope that some people reading this will get a good laugh. If so, worth it. – Jdclark May 05 '21 at 15:56
9

@Jdclark people are re-discovering stuff all the time, everyone does that. Sometimes it’s hard to notice by yourself what you didn’t think about, or recognize. – Tim May 05 '21 at 16:04

score 10 · Answer 2 · answered May 05 '21 at 15:18

As was mentioned in the comments this idea of iteratively learning from previous model errors is at the core of boosting methodologies like Adaboost or gradient boosting.

As you theorize the idea is prone to overfitting in certain models like trees but it actually regularizes a model such a linear regression (although I would just do standard regularization like l2 normally). In terms of algorithms which do well with this typically it's trees (xgboost or lightgbm are go-tos for hammers in the data science community) or some approach which partitions your data. This is because each time you refit the model you get new splits and the tree can learn new things whereas in linear regression it just updates your coefficients so you aren't actually adding any complexity or anything. Adding two regression models just averages the coefficients but adding two tree models gives you a new tree.

This is similar to bagging predictors, bagging linear models will converge to fitting on the whole set whereas bagging trees actually benefits you in terms of the bias-variance tradeoff.

In terms of NNs, I believe there is some theory connecting gradient boosting to residual networks and similar architectures see this question on it.

My recommendation is just use lightgbm or xgboost!

Thank you so much! I especially appreciate the comparison between how boosting affects tree vs. regression models. — Jdclark, May 05 '21 at 16:09

Can I (justifiably) train a second model only on the observations that a previous model predicted poorly?

2 Answers2