3

I'm reading up on bagging (boostrap aggregation), and several sources seem to state that the size of the bags (consist of random sampling from our training set with replacement) is typically around 63% that of the size of the training set.

My understanding is that if the size of the training set is $N$, and for each bag, we draw $N$ samples from the training set, then each bag will have about 63% non repeated samples. But because we drew $N$ samples, shouldn't our bags be size $N$, it's just only about $\frac{2}{3}N$ are unique samples?

When we train each model on a bag, we end up training it on repeated data, or do we discard the repeated data before training?

David
  • 563
  • 2
  • 6
  • Very detailed explanations here: https://stats.stackexchange.com/questions/88980/why-on-average-does-each-bootstrap-sample-contain-roughly-two-thirds-of-observat/513501#513501 – kjetil b halvorsen Mar 20 '21 at 20:47

2 Answers2

1

Yes, a bootstrap sample is in general chosen to be of the same size with the training set, say $N$ and on average it contains $N-N/e\approx 0.63N$ unique samples. I think the sources saying " is typically around 63% that of the size of the training set" mean the same thing, although worded differently. A possible rewording maybe: a bootstrap sample has on average 63% of training set diversity.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • Ah I see. So in the boostrap sample or the "bags," do we remove the duplicates when we train, or do we keep all duplicates during training? I feel like training on duplicate samples leads to overfitting? – David Jun 07 '20 at 21:44
  • that's the idea of a bootstrap sample, generally, we keep them @David – gunes Jun 07 '20 at 21:45
1

In most implementations I've seen the default parameters are to draw N samples from a pool of N with repetition.

If you draw N samples from a pool of N, each sample has a probability of $(1-1/n)^n \approx 1/e$ of being never selected. And there for the expected number of samples missed is $N/e$ and the expected number of unique samples is $(1-1/e)N \approx 0.63N$

Meir Maor
  • 310
  • 2
  • 6
  • Do you know if the training is performed on the $N$ samples, or only on the unique samples, which is about $0.63N$? In other words, do they remove the duplicates prior to training? – David Jun 07 '20 at 21:45
  • on all samples selected. That gives you a good estimate of the true distribution with important variability between bags. The weight of each sample varies. – Meir Maor Jun 08 '20 at 04:18
  • I once implemented by streaming the data in one pass and giving each record weight sampled from binomial distribution which gives the same effect. – Meir Maor Jun 08 '20 at 04:22