19

I've (approximately) heard that:

bagging is a technique to reduce the variance of an predictor/estimator/learning algorithm.

However, I have never seen a formal mathematical proof of this statement. Does anyone know why this is mathematically true? It just seems to be such a widely accepted/known fact, that I'd expect a direct reference to this. I'd be surprised if there is non. Also, does anyone know what effect this has on the bias?

Are there any other theoretical guarantees of approaches bagging that anyone knows and thinks is important and wants to share it?

Charlie Parker
  • 5,836
  • 11
  • 57
  • 113

1 Answers1

22

The main use-case for bagging is reducing variance of low-biased models by bunching them together. This was studied empirically in the landmark paper "An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants" by Bauer and Kohavi. It usually works as advertised.

However, contrary to popular belief, bagging is not guaranteed to reduce the variance. A more recent and (in my opinion) better explanation is that bagging reduces the influence of leverage points. Leverage points are those that disproportionately affect the resulting model, such as outliers in least-squares regression. It is rare but possible for leverage points to positively influence resulting models, in which case bagging reduces performance. Have a look at "Bagging equalizes influence" by Grandvalet.

So, to finally answer your question: the effect of bagging largely depends on leverage points. Few theoretical guarantees exist, except that bagging linearly increases computation time in terms of bag size! That said, it is still a widely used and very powerful technique. When learning with label noise, for instance, bagging can produce more robust classifiers.

Rao and Tibshirani have given a Bayesian interpretation in "The out-of-bootstrap method for model averaging and selection":

In this sense, the bootstrap distribution represents an (approximate) nonparametric, non-informative posterior distribution for our parameter. But this bootstrap distribution is obtained painlessly- without having to formally specify a prior and without having to sample from the posterior distribution. Hence we might think of the bootstrap distribution as a poor man's" Bayes posterior.

Marc Claesen
  • 17,399
  • 1
  • 49
  • 70
  • 1
    How does the 'leverage points' explanation apply to trees, which are often recommended for bagging? While it's clear what high leverage points are for linear regression, what are these points for trees? – DavidR Mar 17 '15 at 20:18
  • found another reference to this question: http://www.quora.com/Are-there-any-theoretical-guarantees-or-justifications-for-bagging-methods-in-machine-learning what do u think? does this contradict the fact u said it doesn't reduce the variance theoretically? – Charlie Parker Mar 21 '15 at 18:46
  • I saw that wikipedia says that bagging (aka bootstrap aggregation) lowers variance. If there is no theoretical evidence for this, does this mean that the article is wrong? – Charlie Parker Feb 16 '16 at 21:16
  • In most cases, bagging does lower variance but that's not its actual mechanism. Grandvalet has shown examples where it increases variance, and illustrated that the mechanism is closer related to equalizing influence of data points that strongly affect the model, such as outliers in least-squares regression, which in most cases reduces variance. – Marc Claesen Feb 17 '16 at 07:38