7

Can somebody give me a non-mathematical intuition why Bootstrap aggregating reduces overfitting?

From my point of view, we are not providing any additional information, we are not really enlengthen the observation number.

Firebug
  • 15,262
  • 5
  • 60
  • 127
J3lackkyy
  • 535
  • 1
  • 9
  • 1
    What do you mean by reducing overfitting? Bootstrap is used to create confidence intervals. What would you be bootstrapping, and for which parameter would you be calculating a confidence interval? – Dave Jul 26 '21 at 19:13
  • For instance, we have a decision tree. And one decision tree tends to overfit. Therefore, we build bootstrap samples to grow multiple decision trees and then we produce a decision as a consens of the grown trees. – J3lackkyy Jul 26 '21 at 19:16
  • 6
    Each tree overfits, but each tree overfits in a different way. When you average, the parts of the trees that capture true signal tend to reinforce, but the overfit parts tend to average each other away to zero. – Matthew Drury Jul 26 '21 at 19:17
  • Thank you very much! – J3lackkyy Jul 26 '21 at 19:31
  • @MatthewDrury could you expand it into an answer? – Tim Jul 27 '21 at 13:16
  • 1
    I'm pretty sure OP meant "Bootstrap aggregating", so not any kind of ensembling (i.e. traditional boosting) applies. – Firebug Oct 05 '21 at 14:51

2 Answers2

4

This is a phenomenon so called weak learners (see) in an ensemble decision yields a good performance. The reason of this explained by Dietterich here:

Uncorrelated errors made by the individual classifiers can be removed by voting.

Further explanation or theoretical justification of the statement could be an open research problem.

msuzen
  • 1,709
  • 6
  • 27
  • Weak learners have low variance, high bias. That way they consistently beat chance performance, but not much more than that. So that their aggregated decision through boosting results in a strong learner. Boosting is not immune to overfitting. Bagging, on the other hand, starts with high variance, low bias learners, which we could call "overfitted". Then, ensembling by bagging controls their variance, or "reduces overfitting" in a way. – Firebug Oct 05 '21 at 14:48
0

In order to illustrate why averaging reduce the standard deviation and make prediction more accurate i'll give an example.

Let's suppose that we have two models. The predictions are random variables $X_1 \sim N(\mu, \sigma), X_2 \sim N(\mu, \sigma)$ i.e. the prediction has a mean value plus an error term.

Considering that the errors are uncorrelated the average is:

$\frac{X_1 + X_2}{2}$ which is also normally distributed with average of $\mu$ and standard deviation of $\frac{2\sigma}{4} = \frac{\sigma}{2}$

i.e. we were able to reserve the same average while reducing the standard deviation.

Having said that, in reality the errors has some correlation and we can achieve variance reduction, but usually it has some limits and cannot be reduce towards zero by increasing the ensemble size.

ofer-a
  • 1,008
  • 5
  • 9