11

This question is derived from this one about the ".632 Rule." I am writing with particular reference to user603's answer/notation to the extent it simplifies matters.

That answer begins with a sample of size $n,$ with replacement, from $n$ distinct items in collection (call) it N. The probability that the $i^{th}$ sample $s_i$ is different from a particular element $m$ of N is then $(1 - 1/n).$

In that answer all elements of N have an equal chance of being randomly drawn.

My question is this: suppose instead that in the above question the items to be drawn are such that they are normally distributed. That is, we subdivide the standard normal curve from $Z = -4$ to $Z = 4$ into (say) 100 equal-length subintervals. Each of the 100 items in N has a probability of being drawn that is equal to the area subtended by the curve in its respective interval.

My thinking was as follows:

The reasoning is similar to that in the linked answer I think. The probability that $s_i \ne m$, with $m$ an element of N, is $P(s_i \neq m) = (1 - F_i)$ in which $F_i$ is the probability of drawing $s_i.$

The probability that a particular element m is in the sample S of size n is

$$P(m \in S) = 1 - P(m \notin S) = 1 - \prod_1^n P(s_i \neq m)$$ $$ = 1 - \prod_1^n(1 - F_i). $$

A calculation seems to show that as the length of the subintervals gets small the answer converges to the same number as in the first case (probabilities of $s_i$ all equal).

This seems counterintuitive (to me) because the construction seems to throw in elements of N which are rare, so I would expect a number smaller than .632.

Also, if this is correct, I guess we would have

$$ \lim_{n \to \infty} \prod_1^n(1 - F_i) =\lim (1- 1/n)^n = 1/e, $$

which I don't know to be true or false yet.

Edit: If it's true it would probably generalize some.

Thanks for any insights.

daniel
  • 237
  • 2
  • 7
  • I just asked about the last equation on Mathematics SE (question 791114) because I am also interested in how it generalizes, if at all. – daniel May 12 '14 at 02:01
  • ...and the short answer is that the last equality is correct for well-behaved PDFs, so the answer to the question is that the .632 rule holds for a wide variety of underlying distributions. – daniel May 13 '14 at 21:05
  • Can I lift someone else's answer from another site and post it here as mine? That's why I posted the brief comment. Maybe there's an accepted way of doing this, if so I am amenable. – daniel Jul 20 '16 at 15:49
  • of course you can, just mention the source at some point :) – Firebug Jul 20 '16 at 16:15
  • @Firebug: can you point to an instance where this is done so I can see what you mean? Thanks. – daniel Jul 20 '16 at 16:17
  • http://stackoverflow.com/help/referencing – Firebug Jul 20 '16 at 16:26

1 Answers1

3

The question asks about the limiting behavior of

$$= 1 - \prod_{i=1}^n(1 - F_i)\tag{1}$$

as $n$ grows and the $F_i$ uniformly shrink in such a way that (a) all are non-negative and (b) they sum to unity. (These follow from the construction of the $F_i$ and the axioms of probability.)

By definition, this product is the exponential of its logarithm:

$$\prod_{i=1}^n(1 - F_i) = \exp\left(\sum_{i=1}^n\log\left(1-F_i\right)\right).$$

Taylor's Theorem (with the Lagrange form of the remainder), applied to $\log$, establishes that

$$\log\left(1-F_i\right) = -F_i - \frac{1}{2}\phi_i^2 \ge -F_i - \frac{1}{2}F_i^2$$

for some $\phi_i$ in the interval $[0, F_i]$. In other words, these logarithms equal $-F_i$ up to terms that are some at most $1/2$ times $F_i^2$. But when $n$ is large enough to assure that all the $F_i$ are smaller than some given $\epsilon\gt 0$ (a condition assured by the uniform shrinkage of the $F_i$), then (b) implies $n\epsilon \gt \sum F_i = 1$ and therefore

$$\sum_{i=1}^n F_i^2 \le \sum_{i=1}^n \epsilon^2 \lt \sum_{i=1}^n \left(\frac{1}{n}\right)^2 =\frac{1}{n}.$$

Consequently

$$-1 = -\sum_{i=1}^n F_i \ge \sum_{i=1}^n\log\left(1-F_i\right) \ge -\sum_{i=1}^n F_i - \frac{1}{2}\frac{1}{n} = -1 - \frac{1}{2n}$$

squeezes the logarithm between two sequences converging to $-1$. Since $\exp$ is continuous, the product $\prod_{i=1}^n(1 - F_i)$ converges to the exponential of this limit, $\exp(-1)$. Consequently

$$\lim_{n\to\infty} \left(1 - \prod_{i=1}^n(1 - F_i)\right) = 1 - \exp(-1) \approx 0.632,$$

QED.


A closer look at this analysis establishes that the error in this approximation (which will always be a lower bound) is no greater in size than $$\left(\exp\left((n/2)\max(F_i^2)\right) - 1\right)\exp(-1).$$ For instance, the division of a standard Normal distribution into $n=400$ slices between $-4$ and $4$ produces a maximum $F_i$ near the mode $0$, where it will approximately equal the area of a rectangle there, $\exp(-1/2)/50 \approx 0.012$. The foregoing bound establishes the value of formula $(1)$ will be within $0.011$ of its limiting value. The actual error is an order of magnitude less, $0.001041$. Here's the calculation in R (which we can trust because none of the $f_i$ is truly small relative to $1$):

f <- diff(pnorm(seq(-4, 4, length.out=401))) # The normal "slices".
f <- f / sum(f)                              # Make them sum to unity.
exp(-1) - prod(1 - f)                        # Compute the error.

Indeed, 1 - prod(1-f) is $0.6331615\ldots$ whereas $1-\exp(-1)$ is $0.6321206\ldots$.

whuber
  • 281,159
  • 54
  • 637
  • 1,101