Basic bootstrapping: 1) why resample rather than use the sample's distribution? 2) assuming what one wants to find out?

Question

In a paper I came across 'non-parametric bootstrapping', which I hadn't heard of. It's used as a way to deal with small sample sizes. But my question is not yet about whether that is appropriate (cf. this question). I'm still trying to understand the basics. I read this, but am still confused!

example case: Suppose there is a population of worms and I want to know the average length. There are zillions of them. But I can only measure 100 individuals. I do so, and compute the mean. I wonder how much it really tells me about the wider population; so I follow the bootstrapping-with-resampling procedure and get a distribution for the means of the resamplings.

1) Why go through the resampling procedure to produce a distribution-of-means-of-length, rather than 'simply' use the original sample's distribution-of-length, as an indicator for the population length distribution?

2) Suppose the distribution-of-means-of-length turns out roughly Gaussian. So this certainly tells me something about the population, beyond what my sole sample mean tells me: for one thing, that there is some variability in lenghts (but as in point (1) above, would not my sample length distribution have told me that too). Now, AFAIU, this may further tell me something about the shape of the population distribution (as is the intention - right?) But then again, it may not! Perhaps the population length distribution is really bimodal with different-mean-length morphs in winter and summer, unbeknownst to me; and alas, I only measured in one season. So my sample of n=100 is unimodal and biased, but I don't know it. I don't see how the bootstrapping helps here. Resampling from that biased sample will never say anything about the 'missing mode'. But wasn't the goal of the bootstrapping precisely to let information in my sample enlighten me about the population distribution, or at least about how representative my sample is?

If I don't know whether the winter/summer scenario is the case, nothing about the bootstrapping seems to decide it for me. If I already knew about the winter/summer scenario, I would already know something about how my sample relates to the population, and would not have needed the bootstrapping (as much) to begin with? It's as if you want to know whether your sample is representative of the population; but the procedure circularly assumes that it is.

I must be missing the point... How?

Re (1): how do you propose to deduce the distribution of means-of-length? That's definitely not the same as the distribution of the lengths themselves (as you will find even in a sample of size two!). Re (2): the point in this hypothetical example is to learn about the *mean*, not about the entire population distribution. A good explanation of ideas behind bootstrapping is in Efron & Tibshirani, *An Introduction to the Bootstrap.* — whuber, Feb 10 '17 at 14:09
@whuber Thanks, and sorry for the very late reply: trying to understand this again after a hiatus. — Bryum, Jul 21 '17 at 13:33
Re (1): how? -- precisely by using the bootstrap method? Isn't deducing the distribution of the mean-of-_x_ (or some other statistic of interest) precisely what it does, via the resampling procedure? Now you're right, the latter distribution (call it A) won't be exactly the same as that of the sample's _x_ itself (call that B). But why would A be a better guess than B of the _distribution of x for the whole population_ which is what we're really trying to gauge? Considering that A takes more effort to calculate than B. — Bryum, Jul 21 '17 at 13:34
Re (2): Yes, the point is to learn about the mean... but also to know whether the mean of your sample is any indication of the population mean. For that you would ideally take many samples from the population, but if you cannot, that's where the bootstrap comes in, right? Anyway, rereading the top answer to [the previously linked question](https://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works), I suppose the answer is that indeed the bootstrap only works "to the extent that the original sample is a good one". But this still feels circular to me. — Bryum, Jul 21 '17 at 13:34
@Bryum the concept of a population is elusive. If your study design draws a biased sample, the bootstrap will not fix that for *any* estimate. The only population which the bootstrap reflects is the one obtained by repeating the *exact* experiment again and again with independent draws of data each time. — AdamO, Dec 13 '17 at 17:54

score 1 · Answer 1 · edited Dec 13 '17 at 17:50

1

The nonparametric bootstrap does use the sample distribution. You resample with replacement from the empirical distribution function which is a discrete probability function that assigns a sampling probability of 1/n for each unique observation in the sample. You can learn more about this is my book with Robert LaBudde An Introduction to Bootstrap Methods with Applications to R published by Wiley in 2011.

edited Dec 13 '17 at 17:50

AdamO

52,330
5
104
209

answered Feb 10 '17 at 13:58

Michael R. Chernick

39,640
28
74
143

What is the justification for downvoting my answer? Maybe it was a mistake. – Michael R. Chernick Feb 11 '17 at 02:55
I did not downvote (yet), but I find this to be more of a comment than an answer. – amoeba Feb 12 '17 at 20:13
1

@Michael Hi, I did not downvote either. I don't understand how your answer addresses my questions though... I suppose I posted here so I wouldn't have to dig through books! – Bryum Jul 21 '17 at 13:42
1

@Bryum I addressed number 1 in the title to your question. If you don't know much about the bootstrap and want to learn it is a deep subject. I mentioned one of my books but there are many others as well as many posts on this site. To learn you need to read or take a course. – Michael R. Chernick Jul 21 '17 at 13:54

Basic bootstrapping: 1) why resample rather than use the sample's distribution? 2) assuming what one wants to find out?

1 Answers1