55

I have run across the assertion that each bootstrap sample (or bagged tree) will contain on average approximately $2/3$ of the observations.

I understand that the chance of not being selected in any of $n$ draws from $n$ samples with replacement is $(1- 1/n)^n$, which works out to approximately $1/3$ chance of not being selected.

What is a mathematical explanation for why this formula always gives $\approx 1/3$ ?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
xyzzy
  • 823
  • 2
  • 8
  • 7
  • 11
    I believe this is the origin of the $.632$ in the bootstrap 632+ rule. – gung - Reinstate Monica Mar 06 '14 at 02:46
  • How is 1-1 in the numerator not 0? I have an equation for chance of not being picked in a particular bootstrap sample of size n of ((n-1)/n)**n. But now I have to figure out how to connect it to the euler's number. – mLstudent33 Feb 19 '20 at 03:45

6 Answers6

51

More precisely, each bootstrap sample (or bagged tree) will contain $1-\frac{1}{e} \approx 0.632$ of the sample.

Let's go over how the bootstrap works. We have an original sample $x_1, x_2, \ldots x_n$ with $n$ items in it. We draw items with replacement from this original set until we have another set of size $n$.

From that, it follows that the probability of choosing any one item (say, $x_1$) on the first draw is $\frac{1}{n}$. Therefore, the probability of not choosing that item is $1 - \frac{1}{n}$. That's just for the first draw; there are a total of $n$ draws, all of which are independent, so the probability of never choosing this item on any of the draws is $(1-\frac{1}{n})^n$.

Now, let's think about what happens when $n$ gets larger and larger. We can take the limit as $n$ goes towards infinity, using the usual calculus tricks (or Wolfram Alpha): $$ \lim_{n \rightarrow \infty} \big(1-\frac{1}{n}\big)^n = \frac{1}{e} \approx 0.368$$

That's the probability of an item not being chosen. Subtract it from one to find the probability of the item being chosen, which gives you 0.632.

Matt Krause
  • 19,089
  • 3
  • 60
  • 101
  • Why does the sign flip for 1/e from the definition of e? Inside the brackets, for e it was (1+1/n)^n and now it is (1-1/n)^n – mLstudent33 Feb 19 '20 at 03:54
  • 1
    @Glen_B nails this in the last line of his answer. Start with $e^x = \lim_{n\to \infty} \left(1 + x/n \right) ^n$ and drop in $x=-1$ to get $\frac{1}{e}$ instead. You'll get $\lim_{n\to \infty} \left(1 + \frac{-1}{n}\right)^n$, which simplifies to $\lim_{n\to \infty} \left(1 - \frac{1}{n}\right)^n$. – Matt Krause Feb 19 '20 at 15:48
  • "Fact 2: When n is large, exp(x/n)≈1+x/n This follows from the series expansion for ex." So the x rather than 1 in the original equation is the result of some sort of approximation after working out a series. I've only touched a bit on Taylor and Fourier but is it etiher one of these that yields that approximation? – mLstudent33 Feb 19 '20 at 23:13
  • 1
    Exactly, it's the Taylor series: $e^x = 1 + \frac{x^1}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots $, but with $x$ replaced by $\frac{x}{n}$. – Matt Krause Feb 20 '20 at 22:56
38

Essentially, the issue is to show that $\lim_{n\to\infty}(1- 1/n)^n=e^{-1}$
(and of course, $e^{-1} =1/e \approx 1/3$, at least very roughly).

It doesn't work at very small $n$ -- e.g. at $n=2$, $(1- 1/n)^n=\frac{1}{4}$. It passes $\frac{1}{3}$ at $n=6$, passes $0.35$ at $n=11$, and $0.366$ by $n=99$. Once you go beyond $n=11$, $\frac{1}{e}$ is a better approximation than $\frac{1}{3}$.

enter image description here

The grey dashed line is at $\frac{1}{3}$; the red and grey line is at $\frac{1}{e}$.

Rather than show a formal derivation (which can easily be found), I'm going to give an outline (that is an intuitive, handwavy argument) of why a (slightly) more general result holds:

$$e^x = \lim_{n\to \infty} \left(1 + x/n \right)^n$$

(Many people take this to be the definition of $\exp(x)$, but you can prove it from simpler results such as defining $e$ as $\lim_{n\to \infty} \left(1 + 1/n \right)^n$.)

Fact 1: $\exp(x/n)^n=\exp(x)\quad$ This follows from basic results about powers and exponentiation

Fact 2: When $n$ is large, $\exp(x/n) \approx 1+x/n\quad$ This follows from the series expansion for $e^x$.

(I can give fuller arguments for each of these but I assume you already know them)

Substitute (2) in (1). Done. (For this to work as a more formal argument would take some work, because you'd have to show that the remaining terms in Fact 2 don't become large enough to cause a problem when taken to the power $n$. But this is intuition rather than formal proof.)

[Alternatively, just take the Taylor series for $\exp(x/n)$ to first order. A second easy approach is to take the binomial expansion of $\left(1 + x/n \right) ^n$ and take the limit term-by-term, showing it gives the terms in the series for $\exp(x/n)$.]

So if $e^x = \lim_{n\to \infty} \left(1 + x/n \right) ^n$, just substitute $x=-1$.

Immediately, we have the result at the top of this answer, $\lim_{n\to\infty}(1- 1/n)^n=e^{-1}$


As gung points out in comments, the result in your question is the origin of the 632 bootstrap rule

e.g. see

Efron, B. and R. Tibshirani (1997),
"Improvements on Cross-Validation: The .632+ Bootstrap Method,"
Journal of the American Statistical Association Vol. 92, No. 438. (Jun), pp. 548-560

Glen_b
  • 257,508
  • 32
  • 553
  • 939
5

Just adding to @retsreg's answer this can also be demonstrated quite easily via numerical simulation in R:

N <- 1e7 # number of instances and sample size
bootstrap <- sample(c(1:N), N, replace = TRUE)
round((length(unique(bootstrap))) / N, 3)
## [1] 0.632
vonjd
  • 5,886
  • 4
  • 47
  • 59
5

Sampling with replacement can be modeled as a sequence of binomial trials where "success" is an instance being selected. For an original dataset of $n$ instances, the probability of "success" is $1/n$, and the probability of "failure" is $(n-1)/n$. For a sample size of $b$, the odds of selecting an instance exactly $x$ times is given by the binomial distribution:

$$ P(x,b,n) = \bigl(\frac{1}{n}\bigr)^{x} \bigl(\frac{n-1}{n}\bigr)^{b-x} {b \choose x}$$

In the specific case of a bootstrap sample, the sample size $b$ equals the number of instances $n$. Letting $n$ approach infinity, we get:

$$ \lim_{n \rightarrow \infty} \bigl(\frac{1}{n}\bigr)^{x} \bigl(\frac{n-1}{n}\bigr)^{n-x} {n \choose x} = \frac{1}{ex!}$$

If our original dataset is big, we can use this formula to compute the probability that an instance is selected exactly $x$ times in a bootstrap sample. For $x = 0$, the probability is $1/e$, or roughly $0.368$. The probability of an instance being sampled at least once is thus $1 - 0.368 = 0.632$.

Needless to say, I painstakingly derived this using pen and paper, and did not even consider using Wolfram Alpha.

retsreg
  • 59
  • 2
2

If you want to look deeper into the sample coverage of the bootstrap, it is worth noting that simple-random-sampling with replacement gives an "occupancy number" that follows the classical occupancy distribution (see e.g., O'Neill 2019). Suppose we have an original sample containing $n$ data points and we take a bootstrap resample, also with $n$ points. Let $K_n$ denote the number of data points in the original sample that appear in the resample. It is well-known that this quantity follows the classical occupancy distribution, with mass function:

$$\mathbb{P}(K_m=k) = \text{Occ}(k|n,n) = \frac{(n)_k \cdot S(n,k)}{n^n}.$$

(The values $(n)_k = \prod_{i=1}^k (n-i+1)$ are the falling factorials and the values $S(n,k)$ are the Stirling numbers of the second kind.) The mean and variance of this occupancy number are:

$$\begin{align} \mathbb{E}(K_n) &= n \bigg[ 1 - \bigg( 1-\frac{1}{n} \bigg)^n \bigg], \\[6pt] \mathbb{V}(K_n) &= n \bigg[ (n-1) \bigg(1-\frac{2}{n} \bigg)^n + \bigg(1-\frac{1}{n} \bigg)^n - n \bigg(1-\frac{1}{n} \bigg)^{2n} \bigg]. \\[6pt] \end{align}$$

Taking $n \rightarrow \infty$ we get the asymptotic equivalence:

$$\mathbb{E}(K_n) \sim n \bigg( 1-\frac{1}{e} \bigg) \quad \quad \quad \mathbb{V}(K_n) \sim \frac{n}{e} \bigg( 1-\frac{1}{e} \bigg).$$

Consequently, as $n \rightarrow \infty$ the proportion of the original data points that are covered by the resample converges to $K_n/n \rightarrow 1-1/e \approx 0.632$. Whilst this is a slightly more complicated presentation of the issue, examination of the classical occupancy distribution allows you to fully describe the stochastic behaviour of the sample coverage.

Ben
  • 91,027
  • 3
  • 150
  • 376
1

This can be easily seen by counting. How many total possible samples? n^n. How many NOT containing a specific value? (n-1)^n. Probability of a sample not having a specific value - (1-1/n)^n, which is about 1/3 in the limit.

Maxim Khesin
  • 131
  • 5