3

I use bootstrapping to generate the distribution / histogram of my sample statistic and find out that the value of my real sample statistic is way out in the tail. What does this mean?

Does it mean that my sample is unrepresentative of the population? But if it is indeed unrepresentative of the population, then wouldn't bootstrapping give the incorrect sampling distribution in the first place?

This is a "what if" question of mine, so if it is somehow impossible for my real sample statistic to be way out in the tail, I'd appreciate an explanation about why as well.

Heisenberg
  • 4,239
  • 3
  • 23
  • 54
  • There's no enough information here to be able to tell what you did, and so no basis on which to interpret the sample value being in the tail. – Glen_b Aug 16 '14 at 00:08
  • What kind of information would you like to know? I could give a more concrete example. Say, as a data analyst I'm handed a survey sample, which was supposed to be a simple random sample, but I have no other information about how well it was implemented. I then use bootstrapping to examine the variance of the sample mean. It turns out that my (real) sample mean is in the tail of the bootstrap distribution. What should I make of this? – Heisenberg Aug 16 '14 at 00:55
  • The problem is knowing precisely what you did when you bootstrapped. What did your bootstrap involve doing? – Glen_b Aug 16 '14 at 12:58
  • By bootstrapping, I mean repeatedly take n random samples with replacement from my real sample. They all have the same size as my real sample. Then for each sample I calculate the sample mean. Then with n "new" sample mean + 1 original sample mean, I plot the distribution. (It is my understanding that this is what bootstrapping means -- is there different ways to do bootstrapping?) – Heisenberg Aug 16 '14 at 13:54
  • There are *many, many* things people do that get called bootstrapping, and in some of those - even where correctly implemented - it's neither surprising nor due to random chance. In some of the more complex cases it might simply indicate inappropriate model choice, for example. It becomes important to try to tease apart issues like "complex but badly wrong model" from "random chance" from "incorrectly implemented bootstrap" from "something nearer to a randomization test simply being called a bootstrap" and so on. Your original information gave no basis on which to tell. – Glen_b Aug 16 '14 at 23:28

1 Answers1

1

This could be explained by the presence of outliers, especially if the estimation process is sensitive to this problem (linear regression, or simple means, for instance).

Suppose we want the mean of a real-valued variable $X$, but there are a few outliers. Then the sample mean will reflect the presence of outliers, whereas most bootstrapped means will exclude them (as each bootstrapped parameter discards a portion of approximately $e^{-1}$ of the data).

jubo
  • 1,042
  • 6
  • 12
  • I think your main point is valuable, by I don't see why you assume that each discards 1/e of the data. – rolando2 Aug 16 '14 at 12:39
  • For each datapoint, the probability of not being chosen in a given bootstrap sample is $P(A)= (1 - \frac{1}{N})^N$, where $N$ is the number of observations. When $N$ tends to infinity, this probability converges to $e^{-1}$. – jubo Aug 16 '14 at 16:54
  • You're welcome. I find it pretty fascinating, it's a nice way to come back to $e$. – jubo Aug 17 '14 at 18:18