0

Say I have the model,

$Y_i = B_0 + B_1X + e_i$

and I have $N$ samples of size $k$. Let $X$ be a dummy variable.

In some samples, I have variation on $X$, and in some I don't (all $0$ or all $1$). For each sample, I attempt to estimate $B_0$ and $B_1$ using OLS. In samples where there is no variation, then $B_1$ is ill-defined.

Assuming that the samples of size three are drawn from the same underlying distribution, what would be the consequence for the distribution of $B_0$, $B_1$ if I discarded samples without variation on $X$. It seems like it should be benign.

fgregg
  • 1,110
  • 1
  • 9
  • 18
  • I think that to get some more usefull response, you should tell us about your real aplication! – kjetil b halvorsen Jan 26 '15 at 21:43
  • Fair point. http://stats.stackexchange.com/questions/135073/confidence-credible-intervals-for-parameter-estimates-from-structured-support-ve – fgregg Jan 26 '15 at 22:12

1 Answers1

2

It will not be benign. It will have the effect of biasing your estimates to think that the frequency of $X$ is more balanced than it really is. This effectively changes the distribution of your data.

For instance, suppose that you have only one instance where $X=1$ and $X=0$ everywhere else. Then, applying the procedure you described will cause you to discard $N-1$ of your $N$ samples, which is basically equivalent to down-weighting the instances where $X=0$ by a factor of about $N$ (since only about $1/N$ of them will appear in your single sample). This is extreme, but the effect will persist with less extreme numbers.

Assuming you have more than $N$ instances in each class of $X$, a better solution is to use stratified random sampling to ensure that each subsample has (roughly) the same class frequencies.

Ben Kuhn
  • 5,373
  • 1
  • 16
  • 27
  • 1
    I see what you mean. Do you know of any references which discuss stratification for subsampling? – fgregg Jan 26 '15 at 19:35
  • 1
    The R [sampling](http://cran.r-project.org/web/packages/sampling/sampling.pdf) package may be what you're looking for. Weirdly, I can't find a great exposition of it, but [wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) and the [Food and Agriculture Organization](http://www.fao.org/docrep/x5684e/x5684e04.htm) have decent ones (if you can get past the extremely fish-themed examples in the latter...) – Ben Kuhn Jan 26 '15 at 19:58
  • 1
    See Cochran's Sampling Techniques, 1977 or Leslie Kish's "Survey Sampling." Any basic sampling book should discuss stratified random sampling. – StatsStudent Jan 26 '15 at 20:00
  • I understand stratified sampling, but I'm looking for references on stratification for [subsampling](http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/csa/node132.html) – fgregg Jan 26 '15 at 20:09