8

I am attempting to run a smaller instance of my regression panel data , because it is a pretty huge regression (fixed effect, Heckman selection) and it takes 4 hours to run every time.

I am interested to estimate the robust standard error of my model. However, I don't want to recreate my dataset to the same size.

  1. If I run a bootstrap on panel data, do I re-sample the rows or the individuals?
  2. Can I construct a smaller sample in bootstrap? Traditionally in textbooks, a bootstrap of size N always requires a re-sample of size N.
Glorfindel
  • 700
  • 1
  • 9
  • 18

2 Answers2

4

To your first question, yes. This is called block bootstrapping. Any time you think you have dependencies in your data, you should bootstrap groups of observations to capture the dependencies. The things you are bootstrapping over should be independent.

To your second question, the answer is also yes. You can make a sample half as big if you want. This won't give you correct standard errors of course. It will give you the standard errors correct for a sample half as big. Perhaps, in your application, you can show analytically that the standard errors are proportional to $1/\sqrt{N}$. In that case, you could bootstrap a sample a quarter as big, get the standard error you care about, and then multiply it by a factor of $1/2$.

Finally, four hours isn't that long. If you get to the exact model you want, a 100 replication bootstrap is only going to take 400 hours. That's 400/24 = 17 days. What's the problem with that? It's less than a month. Dividing the sample by 4 is only going to reduce it to 4 days.

Also, are you taking advantage of parallel processing? I don't know how you are running your analysis or how you plan to bootstrap, but bootstrapping is about the most parallelizable thing ever. With enough processors (100), you could do the whole bootstrap in 4 hours. This is very plausible if you have access to a high performance computing cluster. Even without that, you can probably speed things up by a factor of four just using your desktop computer properly. It's likely got multiple processors each of which can likely do more than one thing at a time.

Bill
  • 7,304
  • 27
  • 33
  • Hi bill, Thank you for your replies. i might just use a smaller sample. i dont have 17 days to my submission of research assignment. i did thought about parallization but i will probably be blacklisted from the supercomputing cluster. it is 128GB per instance of regression running – James JianYong Song Mar 26 '15 at 05:09
1

This was asked a long time ago, but I wrote an answer for a very similar question (maybe these should be linked?) and will post it here as well in case anyone discovers this question in future.

For your first question, @Bill is right -- you should "block bootstrap" the individuals to ensure the dependence structures within each individual's data are respected.

For your second question, in short, the answer is yes: you can do this in many settings, but you should correct for the sample size, since the estimator you are determining is actually different (i.e., for the sample mean, $\frac{1}{N}\sum_{i=1}^N X_i$ is a different estimator than $\frac{1}{M}\sum_{i=1}^M X_i$ if $M \ne N$). This approach is usually called the $M$ out of $N$ boostrap, and it works (in the sense of being consistent) in most settings that the "traditional" bootstrap does, as well as some settings in which it doesn't.

The reason why is that many bootstrap consistency arguments use estimators of the form $\frac{1}{\sqrt{N}} (T_N - \mu)$, where $X_1, \ldots, X_N$ are random variables and $\mu$ is some parameter of the underlying distribution. For example, for the sample mean, $T_N = \frac{1}{N} \sum_{i=1}^N X_i$ and $\mu = \mathbb{E}(X_1)$.

Many bootstrap consistency proofs argue that, as $N \to \infty$, given some finite sample $\{x_1, \ldots, x_N\}$ and associated point estimate $\hat{\mu}_N = T_N(x_1, \ldots, x_N)$, $$ \sqrt{N}(T_N(X_1^*, \ldots, X_N^*) - \hat{\mu}_N) \overset{D}{\to} \sqrt{N}(T_N(X_1, \ldots, X_N) - \mu) \tag{1} \label{convergence} $$ where the $X_i$ are drawn from the true underlying distribution and the $X_i^*$ are drawn with replacement from $\{x_1, \ldots, x_N\}$.

However, we could also use shorter samples of length $M < N$ and consider the estimator $$ \sqrt{M}(T_M(X_1^*, \ldots, X_M^*) - \hat{\mu}_N). \tag{2} \label{m_out_of_n} $$ It turns out that, as $M, N \to \infty$, the estimator (\ref{m_out_of_n}) has s the same limiting distribution as above in most settings where (\ref{convergence}) holds and some where it does not. In this case, (\ref{convergence}) and (\ref{m_out_of_n}) have the same limiting distribution, motivating the correction factor $\sqrt{\frac{M}{N}}$ in e.g. the sample standard deviation.

These arguments are all asymptotic and hold only in the limit $M, N \to \infty$. For this to work, it's important not to pick $M$ too small. There's some theory (e.g. Bickel & Sakov below) as to how to pick the optimal $M$ as a function of $N$ to get the best theoretical results, but in your case computational resources may be the deciding factor.

For some intuition: in many cases, we have $\hat{\mu}_N \overset{D}{\to} \mu$ as $N \to \infty$, so that $$ \sqrt{N}(T_N(X_1, \ldots, X_N) - \mu), \tag{3} \label{m_out_of_n_intuition} $$ can be thought of a bit like an $m$ out of $n$ bootstrap with $m=N$ and $n = \infty$ (I'm using lower case to avoid notation confusion). In this way, emulating the distribution of (\ref{m_out_of_n_intuition}) using an $M$ out of $N$ bootstrap with $M < N$ is a more ``right'' thing to do than the traditional ($N$ out of $N$) kind. An added bonus in your case is that it's less computationally expensive to evaluate.

I know of two good sources in case anyone wants more details on using bootstrap samples shorter than the original sample:

PJ Bickel, F Goetze, WR van Zwet. 1997. Resampling fewer than $n$ observations: gains, losses and remedies for losses. Statistica Sinica.

PJ Bickel, A Sakov. 2008. On the choice of $m$ in the $m$ ouf of $n$ bootstrap and confidence bounds for extrema. Statistica Sinica.

aph416
  • 146
  • 1
  • 5
  • 1
    Thanks for the contributions, aph416. Just FYI, if two questions are very similar, you can flag one of them as a duplicate of the other by clicking on 'flag' just beneath the question and selection the appropriate option (I've done this now). This links the threads, which helps organize the questions and answers on this site, and also allows us to avoid duplicating effort. We also prefer that the same answer is not posted in response to multiple questions. – mkt Sep 12 '19 at 12:29
  • 1
    @mkt Thanks for the info. I was planning on flagging as a duplicate, but I don't have enough reputation for this. Instead I mentioned it in my answer (also, this question has two parts and only the second is a duplicate). Thanks for the info again -- I'll keep this in mind next time. – aph416 Sep 12 '19 at 13:09